[jira] [Created] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)
vijayant soni created SPARK-26086:
-

 Summary: Spark streaming max records per batch interval
 Key: SPARK-26086
 URL: https://issues.apache.org/jira/browse/SPARK-26086
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.3.1
Reporter: vijayant soni


We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vijayant soni updated SPARK-26086:
--
Affects Version/s: (was: 2.3.2)
   2.3.0

> Spark streaming max records per batch interval
> --
>
> Key: SPARK-26086
> URL: https://issues.apache.org/jira/browse/SPARK-26086
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.0
>Reporter: vijayant soni
>Priority: Major
>
> We have an Spark Streaming application that reads from Kinesis and writes to 
> Redshift.
> *Configuration*:
> Number of receivers = 5
> Batch interval = 10 mins
> spark.streaming.receiver.maxRate = 2000 (records per second)
> According to this config, the max records that can be read in a single batch 
> can be calculated using below formula:
>  
> {noformat}
> Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
> (number of receivers) * 2000 (max records per second per receiver)
> 10 * 60 * 5 * 2000 = 6,000,000
> {noformat}
>  
> But the actual number of records is more that the max number.
> Batch I - 6,005,886 records
> Batch II - 6,001,623 records
> Batch III - 6,010,148 records
> Please note that receivers are not even reading at the max rate, the records 
> read per receiver are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vijayant soni updated SPARK-26086:
--
Affects Version/s: (was: 2.3.1)
   2.3.2

> Spark streaming max records per batch interval
> --
>
> Key: SPARK-26086
> URL: https://issues.apache.org/jira/browse/SPARK-26086
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.2
>Reporter: vijayant soni
>Priority: Major
>
> We have an Spark Streaming application that reads from Kinesis and writes to 
> Redshift.
> *Configuration*:
> Number of receivers = 5
> Batch interval = 10 mins
> spark.streaming.receiver.maxRate = 2000 (records per second)
> According to this config, the max records that can be read in a single batch 
> can be calculated using below formula:
>  
> {noformat}
> Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
> (number of receivers) * 2000 (max records per second per receiver)
> 10 * 60 * 5 * 2000 = 6,000,000
> {noformat}
>  
> But the actual number of records is more that the max number.
> Batch I - 6,005,886 records
> Batch II - 6,001,623 records
> Batch III - 6,010,148 records
> Please note that receivers are not even reading at the max rate, the records 
> read per receiver are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vijayant soni updated SPARK-26086:
--
Description: 
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

 
{noformat}
Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver)
10 * 60 * 5 * 2000 = 6,000,000
{noformat}
 

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver are near 1900 per second.

  was:
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver are near 1900 per second.


> Spark streaming max records per batch interval
> --
>
> Key: SPARK-26086
> URL: https://issues.apache.org/jira/browse/SPARK-26086
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.1
>Reporter: vijayant soni
>Priority: Major
>
> We have an Spark Streaming application that reads from Kinesis and writes to 
> Redshift.
> *Configuration*:
> Number of receivers = 5
> Batch interval = 10 mins
> spark.streaming.receiver.maxRate = 2000 (records per second)
> According to this config, the max records that can be read in a single batch 
> can be calculated using below formula:
>  
> {noformat}
> Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
> (number of receivers) * 2000 (max records per second per receiver)
> 10 * 60 * 5 * 2000 = 6,000,000
> {noformat}
>  
> But the actual number of records is more that the max number.
> Batch I - 6,005,886 records
> Batch II - 6,001,623 records
> Batch III - 6,010,148 records
> Please note that receivers are not even reading at the max rate, the records 
> read per receiver are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vijayant soni updated SPARK-26086:
--
Description: 
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver are near 1900 per second.

  was:
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver per second are near 1900 per second.


> Spark streaming max records per batch interval
> --
>
> Key: SPARK-26086
> URL: https://issues.apache.org/jira/browse/SPARK-26086
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.1
>Reporter: vijayant soni
>Priority: Major
>
> We have an Spark Streaming application that reads from Kinesis and writes to 
> Redshift.
> *Configuration*:
> Number of receivers = 5
> Batch interval = 10 mins
> spark.streaming.receiver.maxRate = 2000 (records per second)
> According to this config, the max records that can be read in a single batch 
> can be calculated using below formula:
> {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
> (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 
> 5 * 2000 = 6,000,000 }}
> But the actual number of records is more that the max number.
> Batch I - 6,005,886 records
> Batch II - 6,001,623 records
> Batch III - 6,010,148 records
> Please note that receivers are not even reading at the max rate, the records 
> read per receiver are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval

2018-11-15 Thread vijayant soni (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vijayant soni updated SPARK-26086:
--
Description: 
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver per second are near 1900 per second.

  was:
We have an Spark Streaming application that reads from Kinesis and writes to 
Redshift.

*Configuration*:

Number of receivers = 5

Batch interval = 10 mins

spark.streaming.receiver.maxRate = 2000 (records per second)

According to this config, the max records that can be read in a single batch 
can be calculated using below formula:

{{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
(number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 
* 2000 = 6,000,000 }}

But the actual number of records is more that the max number.

Batch I - 6,005,886 records

Batch II - 6,001,623 records

Batch III - 6,010,148 records

Please note that receivers are not even reading at the max rate, the records 
read per receiver are near 1900 per second.


> Spark streaming max records per batch interval
> --
>
> Key: SPARK-26086
> URL: https://issues.apache.org/jira/browse/SPARK-26086
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.3.1
>Reporter: vijayant soni
>Priority: Major
>
> We have an Spark Streaming application that reads from Kinesis and writes to 
> Redshift.
> *Configuration*:
> Number of receivers = 5
> Batch interval = 10 mins
> spark.streaming.receiver.maxRate = 2000 (records per second)
> According to this config, the max records that can be read in a single batch 
> can be calculated using below formula:
> {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 
> (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 
> 5 * 2000 = 6,000,000 }}
> But the actual number of records is more that the max number.
> Batch I - 6,005,886 records
> Batch II - 6,001,623 records
> Batch III - 6,010,148 records
> Please note that receivers are not even reading at the max rate, the records 
> read per receiver per second are near 1900 per second.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689056#comment-16689056
 ] 

Ruslan Dautkhanov commented on SPARK-26041:
---

[~hyukjin.kwon] I didn't request investigation. I hope creating jira and 
explaining how it happens may help somebody else to solve their problem too, 
no? 

If you haven't noticed this jira has a sequence of SQLs attached as a txt file 
that trigger this problem. There are a couple of other jiras SPARK-13480 and 
SPARK-12940 that seem relevant but were also closed as couldn't reproduce. I 
think there is a long-standing problem when Catalyst excessively overoptimizes 
and cuts some columns form lineage excessively. 

I thought by reporting problems here we help make Spark better, no? 
Unfortunately closing jira as can't reproduce doesn't make this problem 
disappear.

Having said that, I will try to make a reproducible case and upload here, in 
addition to SQLs that are already attached.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  

[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689052#comment-16689052
 ] 

Ruslan Dautkhanov commented on SPARK-26019:
---

No, it was the only instance I had for this problem. I will ask again that user 
who ran into this. 

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-15 Thread Ruslan Dautkhanov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-26019.
---
Resolution: Cannot Reproduce

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION

2018-11-15 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689040#comment-16689040
 ] 

Wenchen Fan commented on SPARK-26078:
-

looks like a bug when we rewrite correlated subquery, cc [~viirya] [~mgaido] 

> WHERE .. IN fails to filter rows when used in combination with UNION
> 
>
> Key: SPARK-26078
> URL: https://issues.apache.org/jira/browse/SPARK-26078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0
>Reporter: Arttu Voutilainen
>Priority: Blocker
>  Labels: correctness
>
> Hey,
> We encountered a case where Spark SQL does not seem to handle WHERE .. IN 
> correctly, when used in combination with UNION, but instead returns also rows 
> that do not fulfill the condition. Swapping the order of the datasets in the 
> UNION makes the problem go away. Repro below:
>  
> {code}
> sql = SQLContext(sc)
> a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}])
> a.registerTempTable('a')
> b.registerTempTable('b')
> bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'a' as source FROM a
> UNION ALL
> SELECT id, num, 'b' as source FROM b
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> no_bug = sql.sql("""
> SELECT id,num,source FROM
> (
> SELECT id, num, 'b' as source FROM b
> UNION ALL
> SELECT id, num, 'a' as source FROM a
> ) AS c
> WHERE c.id IN (SELECT id FROM b WHERE num = 2)
> """)
> bug.show()
> no_bug.show()
> bug.explain(True)
> no_bug.explain(True)
> {code}
> This results in one extra row in the "bug" DF coming from DF "b", that should 
> not be there as it  
> {code:java}
> >>> bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| a|
> |  a|  2| b|
> |  b|  1| b|
> +---+---+--+
> >>> no_bug.show()
> +---+---+--+
> | id|num|source|
> +---+---+--+
> |  a|  2| b|
> |  a|  2| a|
> +---+---+--+
> {code}
>  The reason can be seen in the query plans:
> {code:java}
> >>> bug.explain(True)
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#0, num#1L, a AS source#136]
> :  +- Join LeftSemi, (id#0 = id#4)
> : :- LogicalRDD [id#0, num#1L], false
> : +- Project [id#4]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Join LeftSemi, (id#4#172 = id#4#172)
>:- Project [id#4, num#5L, b AS source#137]
>:  +- LogicalRDD [id#4, num#5L], false
>+- Project [id#4 AS id#4#172]
>   +- Filter (isnotnull(num#5L) && (num#5L = 2))
>  +- LogicalRDD [id#4, num#5L], false
> {code}
> Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition 
> seems wrong, and I believe it causes the LeftSemi to return true for all rows 
> in the left-hand-side table, thus failing to filter as the WHERE .. IN 
> should. Compare with the non-buggy version, where both LeftSemi joins have 
> distinct #-things on both sides:
> {code:java}
> >>> no_bug.explain()
> ...
> == Optimized Logical Plan ==
> Union
> :- Project [id#4, num#5L, b AS source#142]
> :  +- Join LeftSemi, (id#4 = id#4#173)
> : :- LogicalRDD [id#4, num#5L], false
> : +- Project [id#4 AS id#4#173]
> :+- Filter (isnotnull(num#5L) && (num#5L = 2))
> :   +- LogicalRDD [id#4, num#5L], false
> +- Project [id#0, num#1L, a AS source#143]
>+- Join LeftSemi, (id#0 = id#4#173)
>   :- LogicalRDD [id#0, num#1L], false
>   +- Project [id#4 AS id#4#173]
>  +- Filter (isnotnull(num#5L) && (num#5L = 2))
> +- LogicalRDD [id#4, num#5L], false
> {code}
>  
> Best,
> -Arttu 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description

2018-11-15 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689038#comment-16689038
 ] 

Shivaram Venkataraman commented on SPARK-24255:
---

This is a great list -- I dont think we are able to handle all of these 
scenarios ? [~kiszk] do you know of any existing library that parses all the 
version strings ?

> Require Java 8 in SparkR description
> 
>
> Key: SPARK-24255
> URL: https://issues.apache.org/jira/browse/SPARK-24255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> CRAN checks require that the Java version be set both in package description 
> and checked during runtime.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions

2018-11-15 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688995#comment-16688995
 ] 

Wenchen Fan commented on SPARK-20236:
-

This looks like a bug to me. Can you come up with a simple code snippet to 
reproduce this issue and create a ticket? I'll take a closer look. Thanks!

> Overwrite a partitioned data source table should only overwrite related 
> partitions
> --
>
> Key: SPARK-20236
> URL: https://issues.apache.org/jira/browse/SPARK-20236
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> When we overwrite a partitioned data source table, currently Spark will 
> truncate the entire table to write new data, or truncate a bunch of 
> partitions according to the given static partitions.
> For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, 
> {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions 
> that starts with {{a=1}}.
> This behavior is kind of reasonable as we can know which partitions will be 
> overwritten before runtime. However, hive has a different behavior that it 
> only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT 
> 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only 
> one data column and is partitioned by {{a}} and {{b}}.
> It seems better if we can follow hive's behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688997#comment-16688997
 ] 

Sean Owen commented on SPARK-26026:
---

Ah, I think it was this:
{code}
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/AlphaComponent.java:33:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface AlphaComponent {}
  ^
/Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/DeveloperApi.java:36:
 warning: Implementation restriction: subclassing Classfile does not
make your annotation visible at runtime.  If that is what
you want, you must write the annotation class in Java.
public @interface DeveloperApi {}
  ^
...
{code}

It may be that we have to port the annotations to make it work.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688978#comment-16688978
 ] 

Hyukjin Kwon commented on SPARK-26056:
--

Looks we should fix.

> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
> Attachments: sql.jpg
>
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-15 Thread wish (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688976#comment-16688976
 ] 

wish commented on SPARK-26056:
--

[~hyukjin.kwon] done

> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
> Attachments: sql.jpg
>
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-15 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688973#comment-16688973
 ] 

Sean Owen commented on SPARK-26026:
---

Hm, I don't recall why I removed that now. It could have been some issue 
generating scaladoc artifacts with 2.12. Let me try re-enabling it to see 
whether there is an issue now or not. While it's not super important to publish 
them as artifacts, I don't think we intended to stop.

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688964#comment-16688964
 ] 

Hyukjin Kwon commented on SPARK-26067:
--

I don't think we should fix this. This is something we should fix so it's fixed 
in Python 3.7 in any event.

> Pandas GROUPED_MAP udf breaks if DF has >255 columns
> 
>
> Key: SPARK-26067
> URL: https://issues.apache.org/jira/browse/SPARK-26067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Abdeali Kothari
>Priority: Major
>
> When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
> pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
> columns is greater than 255 in Pytohn 3.6 and lower.
> {code:java}
> import pyspark
> from pyspark.sql import types as T, functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + 
> str(i) for i in range(256)])
> new_schema = T.StructType([
> field for field in df.schema] + [T.StructField("new_row", 
> T.DoubleType())])
> def myfunc(df):
> df['new_row'] = 1
> return df
> myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)
> df2 = df.groupBy(["a1"]).apply(myfunc_udf)
> print(df2.count())  # This FAILS
> # ERROR:
> # Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
> #   File 
> "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> # func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
> #   File 
> "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> # mapper = eval(mapper_str, udfs)
> #   File "", line 1
> # SyntaxError: more than 255 arguments
> {code}
> Note: In Python 3.7 the 255 limit was raised, but I have not tried with 
> Pytohn 3.7 
> ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes
> I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my 
> Hadoop Linux cluster and also on my Mac standalone spark installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26064) Unable to fetch jar from remote repo while running spark-submit on kubernetes

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688961#comment-16688961
 ] 

Hyukjin Kwon commented on SPARK-26064:
--

Is it a question or an issue?

> Unable to fetch jar from remote repo while running spark-submit on kubernetes
> -
>
> Key: SPARK-26064
> URL: https://issues.apache.org/jira/browse/SPARK-26064
> Project: Spark
>  Issue Type: Question
>  Components: Kubernetes
>Affects Versions: 2.3.2
>Reporter: Bala Bharath Reddy Resapu
>Priority: Major
>
> I am trying to run spark on kubernetes with a docker image. My requirement is 
> to download the jar from the external repo while running spark-submit. I am 
> able to download the jar using wget in the container but it doesn't work when 
> inputting in the spark-submit command. I am not packaging the jar with docker 
> image. It works fine when I input the jar file inside the docker image. 
>  
> ./bin/spark-submit \
> --master k8s://[https://ip:port|https://ipport/] \
> --deploy-mode cluster \
> --name test3 \
> --class hello \
> --conf spark.kubernetes.container.image.pullSecrets=abcd \
> --conf spark.kubernetes.container.image=spark:h2.0 \
> [https://devops.com/artifactory/local/testing/testing_2.11/h|https://bala.bharath.reddy.resapu%40ibm.com:akcp5bcbktykg2ti28sju4gtebsqwkg2mqkaf9w6g5rdbo3iwrwx7qb1m5dokgd54hdru2...@na.artifactory.swg-devops.com/artifactory/txo-cedp-garage-artifacts-sbt-local/testing/testing_2.11/arithmetic.jar]ello.jar



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688968#comment-16688968
 ] 

Hyukjin Kwon commented on SPARK-26027:
--

Oh, right. Let me keep this in mind and reopen this if it causes an actual 
problem. Should be not a problem for now.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26034:


Assignee: Bryan Cutler  (was: Apache Spark)

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-15 Thread wish (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wish updated SPARK-26056:
-
Attachment: sql.jpg

> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
> Attachments: sql.jpg
>
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688967#comment-16688967
 ] 

Hyukjin Kwon commented on SPARK-26077:
--

cc'ing [~maropu] FYI

> Reserved SQL words are not escaped by JDBC writer for table name
> 
>
> Key: SPARK-26077
> URL: https://issues.apache.org/jira/browse/SPARK-26077
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Eugene Golovan
>Priority: Major
>
> This bug is similar to SPARK-16387 but this time table name is not escaped.
> How to reproduce:
> 1/ Start spark shell with mysql connector
> spark-shell --jars ./mysql-connector-java-8.0.13.jar
>  
> 2/ Execute next code
>  
> import spark.implicits._
> (spark
> .createDataset(Seq("a","b","c"))
> .toDF("order")
> .write
> .format("jdbc")
> .option("url", s"jdbc:mysql://root@localhost:3306/test")
> .option("driver", "com.mysql.cj.jdbc.Driver")
> .option("dbtable", "condition")
> .save)
>  
> , where condition - is reserved word.
>  
> Error message:
>  
> java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check 
> the manual that corresponds to your MySQL server version for the right syntax 
> to use near 'condition (`order` TEXT )' at line 1
>  at 
> com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
>  at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)
>  at 
> com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355)
>  at 
> com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128)
>  at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>  at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656)
>  at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267)
>  ... 59 elided
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688966#comment-16688966
 ] 

Hyukjin Kwon commented on SPARK-26075:
--

Does this happen in Spark 2.4 as well?

> Cannot broadcast the table that is larger than 8GB : Spark 2.3
> --
>
> Key: SPARK-26075
> URL: https://issues.apache.org/jira/browse/SPARK-26075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Neeraj Bhadani
>Priority: Major
>
>  I am trying to use the broadcast join but getting below error in Spark 2.3. 
> However, the same code is working fine in Spark 2.2
>  
> Upon checking the size of the dataframes its merely 50 MB and I have set the 
> threshold to 200 MB as well. As I mentioned above same code is working fine 
> in Spark 2.2
>  
> {{Error: "Cannot broadcast the table that is larger than 8GB". }}
> However, Disabling the broadcasting is working fine.
> {{'spark.sql.autoBroadcastJoinThreshold': '-1'}}
>  
> {{Regards,}}
> {{Neeraj}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26067.
--
Resolution: Not A Problem

> Pandas GROUPED_MAP udf breaks if DF has >255 columns
> 
>
> Key: SPARK-26067
> URL: https://issues.apache.org/jira/browse/SPARK-26067
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Abdeali Kothari
>Priority: Major
>
> When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in 
> pythohn/pandas on a grouped dataframe in spark - it fails if the number of 
> columns is greater than 255 in Pytohn 3.6 and lower.
> {code:java}
> import pyspark
> from pyspark.sql import types as T, functions as F
> spark = pyspark.sql.SparkSession.builder.getOrCreate()
> df = spark.createDataFrame(
> [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + 
> str(i) for i in range(256)])
> new_schema = T.StructType([
> field for field in df.schema] + [T.StructField("new_row", 
> T.DoubleType())])
> def myfunc(df):
> df['new_row'] = 1
> return df
> myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc)
> df2 = df.groupBy(["a1"]).apply(myfunc_udf)
> print(df2.count())  # This FAILS
> # ERROR:
> # Caused by: org.apache.spark.api.python.PythonException: Traceback (most 
> recent call last):
> #   File 
> "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 219, in main
> # func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type)
> #   File 
> "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line 
> 148, in read_udfs
> # mapper = eval(mapper_str, udfs)
> #   File "", line 1
> # SyntaxError: more than 255 arguments
> {code}
> Note: In Python 3.7 the 255 limit was raised, but I have not tried with 
> Pytohn 3.7 
> ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes
> I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my 
> Hadoop Linux cluster and also on my Mac standalone spark installation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26063) CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26063.
--
Resolution: Duplicate

> CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on 
> unresolved object" when requested for numberedTreeString
> ---
>
> Key: SPARK-26063
> URL: https://issues.apache.org/jira/browse/SPARK-26063
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jacek Laskowski
>Priority: Major
>
> The following gives 
> {{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'id}}:
> {code:java}
> // ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0
> scala> spark.version
> res0: String = 2.4.0
> import org.apache.spark.sql.avro._
> val q = spark.range(1).withColumn("to_avro_id", to_avro('id))
> val logicalPlan = q.queryExecution.logical
> scala> logicalPlan.expressions.drop(1).head.numberedTreeString
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'id
> at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
> at 
> org.apache.spark.sql.avro.CatalystDataToAvro.simpleString(CatalystDataToAvro.scala:56)
> at 
> org.apache.spark.sql.catalyst.expressions.Expression.verboseString(Expression.scala:233)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:569)
> at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472)
> at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:469)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.numberedTreeString(TreeNode.scala:483)
> ... 51 elided{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688962#comment-16688962
 ] 

Apache Spark commented on SPARK-26034:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/23056

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688963#comment-16688963
 ] 

Apache Spark commented on SPARK-26034:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/23056

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Bryan Cutler
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26034) Break large mllib/tests.py files into smaller files

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26034:


Assignee: Apache Spark  (was: Bryan Cutler)

> Break large mllib/tests.py files into smaller files
> ---
>
> Key: SPARK-26034
> URL: https://issues.apache.org/jira/browse/SPARK-26034
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957
 ] 

Jason Moore edited comment on SPARK-26027 at 11/16/18 3:10 AM:
---

I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rather than taking the profile flag into account).  If you don't 
see this as an issue to worry about, close this ticket and forget all about it.


was (Author: jasonmoore2k):
I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rater than taking the profile flag into account).  If you don't see 
this as an issue to worry about, close this ticket and forget all about it.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688958#comment-16688958
 ] 

Hyukjin Kwon commented on SPARK-26059:
--

Can you also describe reproducer, and output (if possible screenshot) as well?

> Spark standalone mode, does not correctly record a failed Spark Job.
> 
>
> Key: SPARK-26059
> URL: https://issues.apache.org/jira/browse/SPARK-26059
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Spark Core
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> In order to reproduce submit a failing job to spark standalone master. The 
> status for the failed job is shown as FINISHED, irrespective of the fact it 
> failed or succeeded. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Jason Moore (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957
 ] 

Jason Moore commented on SPARK-26027:
-

I was originally going to withdraw the ticket when I discovered my actual 
issue, and happy for that to happen.  The main concern I was left with was that 
the build scripts download based on the default Scala version (2.11.12 on 
v2.4.0 tag) rater than taking the profile flag into account).  If you don't see 
this as an issue to worry about, close this ticket and forget all about it.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688956#comment-16688956
 ] 

Hyukjin Kwon commented on SPARK-26056:
--

Can you upload screenshots?

> java api spark streaming spark-avro ui 
> ---
>
> Key: SPARK-26056
> URL: https://issues.apache.org/jira/browse/SPARK-26056
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming, Web UI
>Affects Versions: 2.3.2
>Reporter: wish
>Priority: Major
>
> when i use java api spark streaming to read kafka and save avro( databricks 
> spark-avro dependency)
> spark ui :the SQL tabs repeat again and again
>  
> but scala api no problem
>  
> normal ui like this:
>  * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/]
>  * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/]
>  * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/]
>  * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/]
>  * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/]
>  * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/]
>  * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/]
> but java api ui like this:
> Jobs  Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL 
> SQL  ..SQL



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26050) Implment withColumnExpr method on DataFrame

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688955#comment-16688955
 ] 

Hyukjin Kwon commented on SPARK-26050:
--

It's easily able to work around. Currently Spark has too many APIs open. Let's 
avoid to add new APIs unless it's strongly needed.

> Implment withColumnExpr method on DataFrame
> ---
>
> Key: SPARK-26050
> URL: https://issues.apache.org/jira/browse/SPARK-26050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mathew
>Priority: Major
>
> Currently we provide some syntactic sugar in the form of df.selectExpr(), 
> which effectively executes as df.select(expr(), expr(), ...)
> I propose we implement a df.withColumnExpr(), which behaves similarly to 
> df.withColumn(), except without the colName parameter, instead taking column 
> names from the expressions themselves.
> This would stop the unfriendly paradigm of chained 
> .withColumn().withColumn().withColumn() expressions, as we could allow 
> passing as many column expressions as you want.
> Similar to df.selectExpr(), we should support all of: 'column names', 'column 
> expressions', 'column string expressions' as inputs.
> Comments are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26050) Implment withColumnExpr method on DataFrame

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26050.
--
Resolution: Won't Fix

> Implment withColumnExpr method on DataFrame
> ---
>
> Key: SPARK-26050
> URL: https://issues.apache.org/jira/browse/SPARK-26050
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mathew
>Priority: Major
>
> Currently we provide some syntactic sugar in the form of df.selectExpr(), 
> which effectively executes as df.select(expr(), expr(), ...)
> I propose we implement a df.withColumnExpr(), which behaves similarly to 
> df.withColumn(), except without the colName parameter, instead taking column 
> names from the expressions themselves.
> This would stop the unfriendly paradigm of chained 
> .withColumn().withColumn().withColumn() expressions, as we could allow 
> passing as many column expressions as you want.
> Similar to df.selectExpr(), we should support all of: 'column names', 'column 
> expressions', 'column string expressions' as inputs.
> Comments are welcome.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26048:
-
Priority: Major  (was: Blocker)

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Priority: Major
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26048:
-
Target Version/s:   (was: 2.4.1)

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Priority: Major
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688951#comment-16688951
 ] 

Hyukjin Kwon commented on SPARK-26048:
--

Please avoid to set target versions and Critical+ priority which are usually 
reserved for committers.

> Flume connector for Spark 2.4 does not exist in Maven repository
> 
>
> Key: SPARK-26048
> URL: https://issues.apache.org/jira/browse/SPARK-26048
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 2.4.0
>Reporter: Aki Tanaka
>Priority: Blocker
>
> Flume connector for Spark 2.4 does not exist in the Maven repository.
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume]
>  
> [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink]
> These packages will be removed in Spark 3. But Spark 2.4 branch still has 
> these packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26031) dataframe can't load correct after saving to local disk in cluster mode

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26031.
--
Resolution: Invalid

> dataframe can't load correct after saving to local disk in cluster mode
> ---
>
> Key: SPARK-26031
> URL: https://issues.apache.org/jira/browse/SPARK-26031
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: 1 spark master
> 3 spark slaves
>  
>Reporter: Bihui Jin
>Priority: Major
>
> Firstly I saved a spark dataframe to local disk in spark cluster mode with "
> df.write \
> .format('json') \
> .save('file:///root/bughunter/', mode='overwrite')
> " (using interface provide by {color:#FF}pyspark{color})
> Then I load it with "
> spark.read.format('json').load('file:///root/bughunter/')
> "
> But it faild with " org.apache.spark.sql.AnalysisException: Unable to infer 
> schema for JSON. It must be specified manually."
> And I check every node's disk:
> In master:
> only the file named "_SUCCESS" exists in /root/bughunter/;
> In each slave, there is a folder named "_temporary" exists in /root/bughunter/
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26047) Py4JNetworkError (on IPV6): An error occurred while trying to connect to the Java server (127.0.0.1

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688950#comment-16688950
 ] 

Hyukjin Kwon commented on SPARK-26047:
--

Thanks for working around for it. Is it Py4J issue I assume rather then Spark's?

> Py4JNetworkError (on IPV6): An error occurred while trying to connect to the 
> Java server (127.0.0.1
> ---
>
> Key: SPARK-26047
> URL: https://issues.apache.org/jira/browse/SPARK-26047
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Abdul Mateen Mohammed
>Priority: Major
>
> On IPV6, I got the following error when pyspark is invoked:
> h1. Py4JNetworkError: An error occurred while trying to connect to the Java 
> server (127.0.0.1...)
> Where as on IPV4, it is working fine.
> I realized that the issue was due to default address specified as 127.0.0.1 
> in java_gateway.py under py4j-0.10.7-src.zip
> Resolution:
> I was able to fix it by replacing the entry 
> DEFAULT_ADDRESS = "127.0.0.1" with DEFAULT_ADDRESS = "::1"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26045:
-
Target Version/s:   (was: 2.4.0)

> Error in the spark 2.4 release package with the spark-avro_2.11 depdency
> 
>
> Key: SPARK-26045
> URL: https://issues.apache.org/jira/browse/SPARK-26045
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
> Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC 
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Oscar garcía 
>Priority: Major
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Hello I have been problems with the last spark 2.4 release, the read avro 
> file feature does not seem to be working, I have fixed it in local building 
> the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ 
> dependencies.
> With the default spark 2.4 release when I try to read an avro file spark 
> raise the following exception.  
> {code:java}
> spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
> scala> spark.read.format("avro").load("file.avro")
> java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105
> {code}
> Checksum:  spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 
> 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 
> 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26041.
--
Resolution: Cannot Reproduce

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at 

[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688947#comment-16688947
 ] 

Hyukjin Kwon commented on SPARK-26041:
--

[~Tagar] no, don't request investigation here. Please narrow down and describe 
the details. No one can reproduce it for now except you. I am leaving this 
resolved until we get the proper information for this issue.

> catalyst cuts out some columns from dataframes: 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute
> -
>
> Key: SPARK-26041
> URL: https://issues.apache.org/jira/browse/SPARK-26041
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, Spark Core
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
> Environment: Spark 2.3.2 
> Hadoop 2.6
> When we materialize one of intermediate dataframes as a parquet table, and 
> read it back in, this error doesn't happen (exact same downflow queries ). 
>  
>Reporter: Ruslan Dautkhanov
>Priority: Major
>  Labels: catalyst, optimization
> Attachments: SPARK-26041.txt
>
>
> There is a workflow with a number of group-by's, joins, `exists` and `in`s 
> between a set of dataframes. 
> We are getting following exception and the reason that the Catalyst cuts some 
> columns out of dataframes: 
> {noformat}
> Unhandled error: , An error occurred 
> while calling o1187.cache.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 
> in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage 
> 2011.0 (TID 832340, pc1udatahad23, execut
> or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> Binding attribute, tree: part_code#56012
>  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318)
>  at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
>  at 
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46)
>  at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186)
>  at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210)
>  at 
> 

[jira] [Resolved] (SPARK-26040) CSV Row delimiters not consistent between platforms

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26040.
--
Resolution: Duplicate

> CSV Row delimiters not consistent between platforms
> ---
>
> Key: SPARK-26040
> URL: https://issues.apache.org/jira/browse/SPARK-26040
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Heath Abelson
>Priority: Major
>
> Running a spark job on *nix platforms, only unix style row delimiters (\n) 
> are recognized. When running the job on windows, only windows style 
> delimiters are recognized (\r\n).
> The result is that, when trying to read a csv generated my MS excel, on spark 
> running on Linux, extra characters are included in field names and field 
> values that are last on the line.
> Ideally, the CSV parser would be able to handle the 2 different flavors of 
> line endings regardless of what platform the job is being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26040) CSV Row delimiters not consistent between platforms

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688943#comment-16688943
 ] 

Hyukjin Kwon commented on SPARK-26040:
--

I think this is not an issue when {{multiLine}} is disabled because we delegate 
newline handling to Hadoop library which deals with both cases.
The problem is when {{multiLine}} is enabled. This case is fixed in 
https://github.com/apache/spark/pull/22503

This should be a duplicate of SPARK-25493

> CSV Row delimiters not consistent between platforms
> ---
>
> Key: SPARK-26040
> URL: https://issues.apache.org/jira/browse/SPARK-26040
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.3.0
>Reporter: Heath Abelson
>Priority: Major
>
> Running a spark job on *nix platforms, only unix style row delimiters (\n) 
> are recognized. When running the job on windows, only windows style 
> delimiters are recognized (\r\n).
> The result is that, when trying to read a csv generated my MS excel, on spark 
> running on Linux, extra characters are included in field names and field 
> values that are last on the line.
> Ideally, the CSV parser would be able to handle the 2 different flavors of 
> line endings regardless of what platform the job is being run on.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688934#comment-16688934
 ] 

Hyukjin Kwon commented on SPARK-26019:
--

Are you able to make a simple reproducer? If it's about flakiness, we should be 
able to reproduce it when it's executed multiple times.

> pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" 
> in authenticate_and_accum_updates()
> 
>
> Key: SPARK-26019
> URL: https://issues.apache.org/jira/browse/SPARK-26019
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> Started happening after 2.3.1 -> 2.3.2 upgrade.
>  
> {code:python}
> Exception happened during processing of request from ('127.0.0.1', 43418)
> 
> Traceback (most recent call last):
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 290, in _handle_request_noblock
>     self.process_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 318, in process_request
>     self.finish_request(request, client_address)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 331, in finish_request
>     self.RequestHandlerClass(request, client_address, self)
>   File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 
> 652, in __init__
>     self.handle()
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 263, in handle
>     poll(authenticate_and_accum_updates)
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 238, in poll
>     if func():
>   File 
> "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py",
>  line 251, in authenticate_and_accum_updates
>     received_token = self.rfile.read(len(auth_token))
> TypeError: object of type 'NoneType' has no len()
>  
> {code}
>  
> Error happens here:
> https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254
> The PySpark code was just running a simple pipeline of 
> binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
> and then converting it to a dataframe and running a count on it.
> It seems error is flaky - on next rerun it didn't happen.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26016.
--
Resolution: Invalid

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26039) Reading an empty folder as ORC causes an Analysis Exception

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688941#comment-16688941
 ] 

Hyukjin Kwon commented on SPARK-26039:
--

Does this happen in Spark 2.4 as well?

> Reading an empty folder as ORC causes an Analysis Exception
> ---
>
> Key: SPARK-26039
> URL: https://issues.apache.org/jira/browse/SPARK-26039
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Abhishek Verma
>Priority: Minor
>
>  
>  
> {\{val df = spark.read.format("orc").load(orcEmptyFolderPath) }}
>  
> {{org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.; at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) 
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) ... 49 
> elided}}
>  
> {{try
> { spark.read.format("orc").load(path) }
> catch { case ex: org.apache.spark.sql.AnalysisException =>
> { null }
> }}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26031) dataframe can't load correct after saving to local disk in cluster mode

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688939#comment-16688939
 ] 

Hyukjin Kwon commented on SPARK-26031:
--

That's because you're using {{file://...}} in a cluster. The file system should 
usually be a distributed file system that all nodes can access.

> dataframe can't load correct after saving to local disk in cluster mode
> ---
>
> Key: SPARK-26031
> URL: https://issues.apache.org/jira/browse/SPARK-26031
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
> Environment: 1 spark master
> 3 spark slaves
>  
>Reporter: Bihui Jin
>Priority: Major
>
> Firstly I saved a spark dataframe to local disk in spark cluster mode with "
> df.write \
> .format('json') \
> .save('file:///root/bughunter/', mode='overwrite')
> " (using interface provide by {color:#FF}pyspark{color})
> Then I load it with "
> spark.read.format('json').load('file:///root/bughunter/')
> "
> But it faild with " org.apache.spark.sql.AnalysisException: Unable to infer 
> schema for JSON. It must be specified manually."
> And I check every node's disk:
> In master:
> only the file named "_SUCCESS" exists in /root/bughunter/;
> In each slave, there is a folder named "_temporary" exists in /root/bughunter/
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688938#comment-16688938
 ] 

Hyukjin Kwon commented on SPARK-26027:
--

Is the goal to use Scala 2.12.6? The default is now changed to 2.12.6 as of 
https://github.com/apache/spark/commit/ad853c56788fd32e035369d1fe3d96aaf6c4ef16.
 This issue is obsolete and we could better leave this resolved.

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26027.
--
Resolution: Not A Problem

> Unable to build Spark for Scala 2.12 with Maven script provided
> ---
>
> Key: SPARK-26027
> URL: https://issues.apache.org/jira/browse/SPARK-26027
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Jason Moore
>Priority: Minor
>
> In ./build/mvn,  from pom.xml is used to determine which Scala 
> library to fetch but it doesn't seem to use the value under the scala-2.12 
> profile even if that is set.
> The result is that the maven build still uses scala-library 2.11.12 and 
> compilation fails.
> Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think 
> that only updates scala.binary.version)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688937#comment-16688937
 ] 

Hyukjin Kwon commented on SPARK-26026:
--

cc [~srowen]

> Published Scaladoc jars missing from Maven Central
> --
>
> Key: SPARK-26026
> URL: https://issues.apache.org/jira/browse/SPARK-26026
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0
>Reporter: Long Cao
>Priority: Minor
>
> For 2.3.x and beyond, it appears that published *-javadoc.jars are missing.
> For concrete examples:
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
>  * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/]
>  * 
> [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/]
> After some searching, I'm venturing a guess that [this 
> commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033]
>  removed packaging Scaladoc with the rest of the distribution.
> I don't think it's a huge problem since the versioned Scaladocs are hosted on 
> apache.org, but I use an external documentation/search tool 
> ([Dash|https://kapeli.com/dash]) that operates by looking up published 
> javadoc jars and it'd be nice to have these available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688932#comment-16688932
 ] 

Hyukjin Kwon commented on SPARK-26016:
--

Let's avoid to ask investigation in JIRA. It sounds more appropriate to ask it 
to mailing list. Let's discuss this in mailing list first and file a bug here 
when we're clear if it's a bug.
Let me leave this resolved.

> Encoding not working when using a map / mapPartitions call
> --
>
> Key: SPARK-26016
> URL: https://issues.apache.org/jira/browse/SPARK-26016
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Chris Caspanello
>Priority: Major
> Attachments: spark-sandbox.zip
>
>
> Attached you will find a project with unit tests showing the issue at hand.
> If I read in a ISO-8859-1 encoded file and simply write out what was read; 
> the contents in the part file matches what was read.  Which is great.
> However, the second I use a map / mapPartitions function it looks like the 
> encoding is not correct.  In addition a simple collectAsList and writing that 
> list of strings to a file does not work either.  I don't think I'm doing 
> anything wrong.  Can someone please investigate?  I think this is a bug.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25992) Accumulators giving KeyError in pyspark

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688931#comment-16688931
 ] 

Hyukjin Kwon commented on SPARK-25992:
--

Sounds unclear if it's an issue within Spark or not. Would you be interested in 
continuing investigation?

> Accumulators giving KeyError in pyspark
> ---
>
> Key: SPARK-25992
> URL: https://issues.apache.org/jira/browse/SPARK-25992
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
>Reporter: Abdeali Kothari
>Priority: Major
>
> I am using accumulators and when I run my code, I sometimes get some warn 
> messages. When I checked, there was nothing accumulated - not sure if I lost 
> info from the accumulator or it worked and I can ignore this error ?
> The message:
> {noformat}
> Exception happened during processing of request from
> ('127.0.0.1', 62099)
> Traceback (most recent call last):
> File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in 
> _handle_request_noblock
> self.process_request(request, client_address)
> File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in 
> process_request
> self.finish_request(request, client_address)
> File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in 
> finish_request
> self.RequestHandlerClass(request, client_address, self)
> File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in 
> __init__
> self.handle()
> File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, 
> in handle
> _accumulatorRegistry[aid] += update
> KeyError: 0
> 
> 2018-11-09 19:09:08 ERROR DAGScheduler:91 - Failed to update accumulators for 
> task 0
> org.apache.spark.SparkException: EOF reached before Python server acknowledged
>   at 
> org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:634)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1131)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1123)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1123)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1206)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26080:
-
Target Version/s: 2.4.1, 3.0.0  (was: 3.0.0)

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26080:


Assignee: (was: Apache Spark)

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26080:


Assignee: Apache Spark

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Assignee: Apache Spark
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688904#comment-16688904
 ] 

Apache Spark commented on SPARK-26080:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23055

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688900#comment-16688900
 ] 

Apache Spark commented on SPARK-26085:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23054

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688902#comment-16688902
 ] 

Apache Spark commented on SPARK-26085:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/23054

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26085:


Assignee: Apache Spark

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26085:


Assignee: (was: Apache Spark)

> Key attribute of primitive type under typed aggregation should be named as 
> "key" too
> 
>
> Key: SPARK-26085
> URL: https://issues.apache.org/jira/browse/SPARK-26085
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> When doing typed aggregation on a Dataset, for complex key type, the key 
> attribute is named as "key". But for primitive type, the key attribute is 
> named as "value". This key attribute should also be named as "key" for 
> primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too

2018-11-15 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-26085:
---

 Summary: Key attribute of primitive type under typed aggregation 
should be named as "key" too
 Key: SPARK-26085
 URL: https://issues.apache.org/jira/browse/SPARK-26085
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


When doing typed aggregation on a Dataset, for complex key type, the key 
attribute is named as "key". But for primitive type, the key attribute is named 
as "value". This key attribute should also be named as "key" for primitive type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26080:
-
Priority: Blocker  (was: Major)

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26080:
-
Target Version/s: 3.0.0

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Blocker
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1665#comment-1665
 ] 

Hyukjin Kwon commented on SPARK-26080:
--

We should fix this.

> Unable to run worker.py on Windows
> --
>
> Key: SPARK-26080
> URL: https://issues.apache.org/jira/browse/SPARK-26080
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
> Environment: Windows 10 Education 64 bit
>Reporter: Hayden Jeune
>Priority: Major
>
> Use of the resource module in python means worker.py cannot run on a windows 
> system. This package is only available in unix based environments.
> [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]
> {code:python}
> textFile = sc.textFile("README.md")
> textFile.first()
> {code}
> When the above commands are run I receive the error 'worker failed to connect 
> back', and I can see an exception in the console coming from worker.py saying 
> 'ModuleNotFoundError: No module named resource'
> I do not really know enough about what I'm doing to fix this myself. 
> Apologies if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-15 Thread Nagaram Prasad Addepally (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688869#comment-16688869
 ] 

Nagaram Prasad Addepally commented on SPARK-25957:
--

Posted PR [https://github.com/apache/spark/pull/23053] for this ticket.

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688871#comment-16688871
 ] 

Apache Spark commented on SPARK-25957:
--

User 'ramaddepally' has created a pull request for this issue:
https://github.com/apache/spark/pull/23053

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688870#comment-16688870
 ] 

Apache Spark commented on SPARK-25957:
--

User 'ramaddepally' has created a pull request for this issue:
https://github.com/apache/spark/pull/23053

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25957:


Assignee: Apache Spark

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Assignee: Apache Spark
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25957:


Assignee: (was: Apache Spark)

> Skip building spark-r docker image if spark distribution does not have R 
> support
> 
>
> Key: SPARK-25957
> URL: https://issues.apache.org/jira/browse/SPARK-25957
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Nagaram Prasad Addepally
>Priority: Major
>
> [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh]
>  script by default tries to build spark-r image. We may not always build 
> spark distribution with R support. It would be good to skip building and 
> publishing spark-r images if R support is not available in the spark 
> distribution.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25883) Override method `prettyName` in `from_avro`/`to_avro`

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-25883:
-
Fix Version/s: 2.4.1

> Override method `prettyName` in `from_avro`/`to_avro`
> -
>
> Key: SPARK-25883
> URL: https://issues.apache.org/jira/browse/SPARK-25883
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.4.1, 3.0.0
>
>
> Previously in from_avro/to_avro, we override the method `simpleString` and 
> `sql` for the string output. However, the override only affects the alias 
> naming:
> ```
> Project [from_avro('col, 
> ...
> , (mode,PERMISSIVE)) AS from_avro(col, struct, 
> Map(mode -> PERMISSIVE))#11]
> ```
> It only makes the alias name quite long.
> We should follow `from_csv`/`from_json` here, to override the method 
> prettyName only,  and we will get a clean alias name
> ```
> ... AS from_avro(col)#11
> ```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26035.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23034
[https://github.com/apache/spark/pull/23034]

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26035) Break large streaming/tests.py files into smaller files

2018-11-15 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26035:


Assignee: Hyukjin Kwon

> Break large streaming/tests.py files into smaller files
> ---
>
> Key: SPARK-26035
> URL: https://issues.apache.org/jira/browse/SPARK-26035
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams, PySpark
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-15 Thread Simeon Simeonov (JIRA)
Simeon Simeonov created SPARK-26084:
---

 Summary: AggregateExpression.references fails on unresolved 
expression trees
 Key: SPARK-26084
 URL: https://issues.apache.org/jira/browse/SPARK-26084
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Simeon Simeonov


[SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
stable ordering in {{AttributeSet.toSeq}} using expression IDs 
([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
 without noticing that {{AggregateExpression.references}} used 
{{AttributeSet.toSeq}} as a shortcut 
([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
 The net result is that {{AggregateExpression.references}} fails for unresolved 
aggregate functions.

{code:scala}
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
  org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
  mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
  isDistinct = false
).references
{code}

fails with

{code:scala}
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
exprId on unresolved object, tree: 'y
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
at java.util.TimSort.sort(TimSort.java:220)
at java.util.Arrays.sort(Arrays.java:1438)
at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
at scala.collection.AbstractSeq.sorted(Seq.scala:41)
at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
at 
org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
at 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
{code}

The solution is to avoid calling {{toSeq}} as ordering is not important in 
{{references}} and simplify (and speed up) the implementation to something like

{code:scala}
mode match {
  case Partial | Complete => aggregateFunction.references
  case PartialMerge | Final => 
AttributeSet(aggregateFunction.aggBufferAttributes)
}
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees

2018-11-15 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688760#comment-16688760
 ] 

Simeon Simeonov commented on SPARK-26084:
-

/cc [~maropu] [~hvanhovell] who worked on the PR that may have caused this 
problem

> AggregateExpression.references fails on unresolved expression trees
> ---
>
> Key: SPARK-26084
> URL: https://issues.apache.org/jira/browse/SPARK-26084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: aggregate, regression, sql
>
> [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a 
> stable ordering in {{AttributeSet.toSeq}} using expression IDs 
> ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128])
>  without noticing that {{AggregateExpression.references}} used 
> {{AttributeSet.toSeq}} as a shortcut 
> ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]).
>  The net result is that {{AggregateExpression.references}} fails for 
> unresolved aggregate functions.
> {code:scala}
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression(
>   org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr),
>   mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete,
>   isDistinct = false
> ).references
> {code}
> fails with
> {code:scala}
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> exprId on unresolved object, tree: 'y
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128)
>   at scala.math.Ordering$$anon$5.compare(Ordering.scala:122)
>   at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>   at java.util.TimSort.sort(TimSort.java:220)
>   at java.util.Arrays.sort(Arrays.java:1438)
>   at scala.collection.SeqLike$class.sorted(SeqLike.scala:648)
>   at scala.collection.AbstractSeq.sorted(Seq.scala:41)
>   at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623)
>   at scala.collection.AbstractSeq.sortBy(Seq.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201)
> {code}
> The solution is to avoid calling {{toSeq}} as ordering is not important in 
> {{references}} and simplify (and speed up) the implementation to something 
> like
> {code:scala}
> mode match {
>   case Partial | Complete => aggregateFunction.references
>   case PartialMerge | Final => 
> AttributeSet(aggregateFunction.aggBufferAttributes)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688743#comment-16688743
 ] 

Apache Spark commented on SPARK-26083:
--

User 'AzureQ' has created a pull request for this issue:
https://github.com/apache/spark/pull/23037

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688741#comment-16688741
 ] 

Apache Spark commented on SPARK-26083:
--

User 'AzureQ' has created a pull request for this issue:
https://github.com/apache/spark/pull/23037

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26083:


Assignee: Apache Spark

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Assignee: Apache Spark
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26083:


Assignee: (was: Apache Spark)

> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Qi Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Shao updated SPARK-26083:

Description: 
When I try to run
{code:java}
./bin/pyspark{code}
in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm 
getting an error:
{code:java}
$SPARK_HOME/bin/pyspark --deploy-mode client --master 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
"help", "copyright", "credits" or "license" for more information. 
Could not open PYTHONSTARTUP 
IOError: [Errno 2] No such file or directory: 
'/opt/spark/python/pyspark/shell.py'{code}
This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}

  was:
When I try to run
{code:java}
./bin/pyspark{code}
in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm 
getting an error:
{code:java}
$SPARK_HOME/bin/pyspark --deploy-mode client --master 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
"help", "copyright", "credits" or "license" for more information. Could not 
open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 
'/opt/spark/python/pyspark/shell.py'{code}
This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}


> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. 
> Could not open PYTHONSTARTUP 
> IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Qi Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qi Shao updated SPARK-26083:

Description: 
When I try to run
{code:java}
./bin/pyspark{code}
in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm 
getting an error:
{code:java}
$SPARK_HOME/bin/pyspark --deploy-mode client --master 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
"help", "copyright", "credits" or "license" for more information. Could not 
open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 
'/opt/spark/python/pyspark/shell.py'{code}
This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}

  was:
When I try to run {{}}
{code:java}
./bin/pyspark{code}
{{}}in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
I'm getting an error:
{code:java}
$SPARK_HOME/bin/pyspark --deploy-mode client --master 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
"help", "copyright", "credits" or "license" for more information. Could not 
open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 
'/opt/spark/python/pyspark/shell.py'{code}
This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}


> Pyspark command is not working properly with default Docker Image build
> ---
>
> Key: SPARK-26083
> URL: https://issues.apache.org/jira/browse/SPARK-26083
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Qi Shao
>Priority: Minor
>  Labels: easyfix, newbie, patch, pull-request-available
> Fix For: 2.4.1
>
>
> When I try to run
> {code:java}
> ./bin/pyspark{code}
> in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
> I'm getting an error:
> {code:java}
> $SPARK_HOME/bin/pyspark --deploy-mode client --master 
> k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
> Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
> "help", "copyright", "credits" or "license" for more information. Could not 
> open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 
> '/opt/spark/python/pyspark/shell.py'{code}
> This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26083) Pyspark command is not working properly with default Docker Image build

2018-11-15 Thread Qi Shao (JIRA)
Qi Shao created SPARK-26083:
---

 Summary: Pyspark command is not working properly with default 
Docker Image build
 Key: SPARK-26083
 URL: https://issues.apache.org/jira/browse/SPARK-26083
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 2.4.0
Reporter: Qi Shao
 Fix For: 2.4.1


When I try to run {{}}
{code:java}
./bin/pyspark{code}
{{}}in a pod in Kubernetes(image built without change from pyspark Dockerfile), 
I'm getting an error:
{code:java}
$SPARK_HOME/bin/pyspark --deploy-mode client --master 
k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... 
Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type 
"help", "copyright", "credits" or "license" for more information. Could not 
open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: 
'/opt/spark/python/pyspark/shell.py'{code}
This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping

2018-11-15 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25035:
-
Labels: memory-analysis  (was: )

> Replicating disk-stored blocks should avoid memory mapping
> --
>
> Key: SPARK-25035
> URL: https://issues.apache.org/jira/browse/SPARK-25035
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.1
>Reporter: Imran Rashid
>Priority: Major
>  Labels: memory-analysis
>
> This is a follow-up to SPARK-24296.
> When replicating a disk-cached block, even if we fetch-to-disk, we still 
> memory-map the file, just to copy it to another location.
> Ideally we'd just move the tmp file to the right location.  But even without 
> that, we could read the file as an input stream, instead of memory-mapping 
> the whole thing.  Memory-mapping is particularly a problem when running under 
> yarn, as the OS may believe there is plenty of memory available, meanwhile 
> yarn decides to kill the process for exceeding memory limits.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2018-11-15 Thread Martin Loncaric (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-26082:

Description: 
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable overhead network 
traffic and startup time when frequently running spark Applications on a Mesos 
cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are 
additionally copied and left in the sandbox with the cache off (rather than 
extracted directly without an extra copy), this can considerably increase disk 
usage. Users CAN currently workaround by specifying the 
{{spark.mesos.fetchCache.enable}} option, but this should at least be specified 
in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update {{MesosClusterScheduler.scala}} to use 
{{spark.mesos.fetcherCache.enable}} going forward (literally a one-line change).

  was:
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable overhead network 
traffic and startup time when frequently running spark Applications on a Mesos 
cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are 
additionally copied and left in the sandbox with the cache off (rather than 
extracted directly without an extra copy), this can considerably increase disk 
usage. Users CAN currently workaround by specifying the 
{{spark.mesos.fetchCache.enable}} option, but this should at least be specified 
in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward (literally 
a one-line change).


> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to 

[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2018-11-15 Thread Martin Loncaric (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-26082:

Description: 
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable overhead network 
traffic and startup time when frequently running spark Applications on a Mesos 
cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are 
additionally copied and left in the sandbox with the cache off (rather than 
extracted directly without an extra copy), this can considerably increase disk 
usage. Users CAN currently workaround by specifying the 
{{spark.mesos.fetchCache.enable}} option, but this should at least be specified 
in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward (literally 
a one-line change).

  was:
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable overhead network 
traffic and startup time when frequently running spark Applications on a Mesos 
cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are 
additionally copied and left in the sandbox with the cache off (rather than 
extracted directly without an extra copy), this can considerably increase disk 
usage. Users CAN currently workaround by specifying the 
{{spark.mesos.fetchCache.enable}} option, but this should at least be specified 
in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward.


> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> 

[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2018-11-15 Thread Martin Loncaric (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Martin Loncaric updated SPARK-26082:

Description: 
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable overhead network 
traffic and startup time when frequently running spark Applications on a Mesos 
cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are 
additionally copied and left in the sandbox with the cache off (rather than 
extracted directly without an extra copy), this can considerably increase disk 
usage. Users CAN currently workaround by specifying the 
{{spark.mesos.fetchCache.enable}} option, but this should at least be specified 
in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward.

  was:
Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable network traffic when 
frequently running spark Applications on a Mesos cluster. Additionally, since 
extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left 
in the sandbox with the cache off (rather than extracted directly without an 
extra copy), this can considerably increase disk usage. Users CAN currently 
workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this 
should at least be specified in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward.


> Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
> ---
>
> Key: SPARK-26082
> URL: https://issues.apache.org/jira/browse/SPARK-26082
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 
> 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2
>Reporter: Martin Loncaric
>Priority: Major
>
> Currently in 
> [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
> {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
> (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the 
> Mesos Fetcher Cache
> {quote}
> Currently in {{MesosClusterScheduler.scala}} (which passes parameter to 
> driver):
> {{private val useFetchCache = 
> conf.getBoolean("spark.mesos.fetchCache.enable", false)}}
> Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
> caching parameter to executors):
> {{private val useFetcherCache = 
> conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}
> This naming 

[jira] [Created] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler

2018-11-15 Thread Martin Loncaric (JIRA)
Martin Loncaric created SPARK-26082:
---

 Summary: Misnaming of spark.mesos.fetch(er)Cache.enable in 
MesosClusterScheduler
 Key: SPARK-26082
 URL: https://issues.apache.org/jira/browse/SPARK-26082
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0, 2.1.3, 2.1.2, 
2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0
Reporter: Martin Loncaric


Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]:
{quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs 
(example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos 
Fetcher Cache
{quote}

Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver):
{{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", 
false)}}

Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos 
caching parameter to executors):
{{private val useFetcherCache = 
conf.getBoolean("spark.mesos.fetcherCache.enable", false)}}

This naming discrepancy dates back to version 2.0.0 
([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]).

This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the 
Mesos cache will be used only for executors, and not for drivers.

IMPACT:
Not caching these driver files (typically including at least spark binaries, 
custom jar, and additional dependencies) adds considerable network traffic when 
frequently running spark Applications on a Mesos cluster. Additionally, since 
extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left 
in the sandbox with the cache off (rather than extracted directly without an 
extra copy), this can considerably increase disk usage. Users CAN currently 
workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this 
should at least be specified in the documentation.

SUGGESTED FIX:
Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 
2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26081) Do not write empty files by text datasources

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26081:


Assignee: (was: Apache Spark)

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26081) Do not write empty files by text datasources

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26081:


Assignee: Apache Spark

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26081) Do not write empty files by text datasources

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688664#comment-16688664
 ] 

Apache Spark commented on SPARK-26081:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/23052

> Do not write empty files by text datasources
> 
>
> Key: SPARK-26081
> URL: https://issues.apache.org/jira/browse/SPARK-26081
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Text based datasources like CSV, JSON and Text produces empty files for empty 
> partitions. This introduces additional overhead while opening and reading 
> such files back. In current implementation of OutputWriter, the output stream 
> are created eagerly even no records are written to the stream. So, creation 
> can be postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688653#comment-16688653
 ] 

Apache Spark commented on SPARK-23128:
--

User 'justinuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23051

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688654#comment-16688654
 ] 

Apache Spark commented on SPARK-23128:
--

User 'justinuang' has created a pull request for this issue:
https://github.com/apache/spark/pull/23051

> A new approach to do adaptive execution in Spark SQL
> 
>
> Key: SPARK-23128
> URL: https://issues.apache.org/jira/browse/SPARK-23128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Carson Wang
>Priority: Major
> Attachments: AdaptiveExecutioninBaidu.pdf
>
>
> SPARK-9850 proposed the basic idea of adaptive execution in Spark. In 
> DAGScheduler, a new API is added to support submitting a single map stage.  
> The current implementation of adaptive execution in Spark SQL supports 
> changing the reducer number at runtime. An Exchange coordinator is used to 
> determine the number of post-shuffle partitions for a stage that needs to 
> fetch shuffle data from one or multiple stages. The current implementation 
> adds ExchangeCoordinator while we are adding Exchanges. However there are 
> some limitations. First, it may cause additional shuffles that may decrease 
> the performance. We can see this from EnsureRequirements rule when it adds 
> ExchangeCoordinator.  Secondly, it is not a good idea to add 
> ExchangeCoordinators while we are adding Exchanges because we don’t have a 
> global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 
> 3 tables’ join in a single stage, the same ExchangeCoordinator should be used 
> in three Exchanges but currently two separated ExchangeCoordinator will be 
> added. Thirdly, with the current framework it is not easy to implement other 
> features in adaptive execution flexibly like changing the execution plan and 
> handling skewed join at runtime.
> We'd like to introduce a new way to do adaptive execution in Spark SQL and 
> address the limitations. The idea is described at 
> [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26081) Do not write empty files by text datasources

2018-11-15 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26081:
--

 Summary: Do not write empty files by text datasources
 Key: SPARK-26081
 URL: https://issues.apache.org/jira/browse/SPARK-26081
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Text based datasources like CSV, JSON and Text produces empty files for empty 
partitions. This introduces additional overhead while opening and reading such 
files back. In current implementation of OutputWriter, the output stream are 
created eagerly even no records are written to the stream. So, creation can be 
postponed up to the first write.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26080) Unable to run worker.py on Windows

2018-11-15 Thread Hayden Jeune (JIRA)
Hayden Jeune created SPARK-26080:


 Summary: Unable to run worker.py on Windows
 Key: SPARK-26080
 URL: https://issues.apache.org/jira/browse/SPARK-26080
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.0
 Environment: Windows 10 Education 64 bit
Reporter: Hayden Jeune


Use of the resource module in python means worker.py cannot run on a windows 
system. This package is only available in unix based environments.
[https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25]

{code:python}
textFile = sc.textFile("README.md")
textFile.first()
{code}
When the above commands are run I receive the error 'worker failed to connect 
back', and I can see an exception in the console coming from worker.py saying 
'ModuleNotFoundError: No module named resource'

I do not really know enough about what I'm doing to fix this myself. Apologies 
if there's something simple I'm missing here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26079:


Assignee: Apache Spark

> Flaky test: StreamingQueryListenersConfSuite
> 
>
> Key: SPARK-26079
> URL: https://issues.apache.org/jira/browse/SPARK-26079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>Priority: Minor
>
> We've had this test fail a few times in our builds.
> {noformat}
> org.scalatest.exceptions.TestFailedException: null equaled null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> {noformat}
> You can reproduce it reliably by adding a sleep in the test listener. Fix 
> coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14220) Build and test Spark against Scala 2.12

2018-11-15 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688564#comment-16688564
 ] 

Stavros Kontopoulos edited comment on SPARK-14220 at 11/15/18 7:45 PM:
---

[~SeanShubin] That one was fixed here: 
[https://jira.apache.org/jira/browse/SPARK-22128] if not mistaken [~srowen] 
knows more.


was (Author: skonto):
That one was fixed here: [https://jira.apache.org/jira/browse/SPARK-22128] if 
not mistaken [~srowen] knows more.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-11-15 Thread Stavros Kontopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688564#comment-16688564
 ] 

Stavros Kontopoulos commented on SPARK-14220:
-

That one was fixed here: [https://jira.apache.org/jira/browse/SPARK-22128] if 
not mistaken [~srowen] knows more.

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Assignee: Sean Owen
>Priority: Blocker
>  Labels: release-notes
> Fix For: 2.4.0
>
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite

2018-11-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26079:


Assignee: (was: Apache Spark)

> Flaky test: StreamingQueryListenersConfSuite
> 
>
> Key: SPARK-26079
> URL: https://issues.apache.org/jira/browse/SPARK-26079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We've had this test fail a few times in our builds.
> {noformat}
> org.scalatest.exceptions.TestFailedException: null equaled null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> {noformat}
> You can reproduce it reliably by adding a sleep in the test listener. Fix 
> coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite

2018-11-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688561#comment-16688561
 ] 

Apache Spark commented on SPARK-26079:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/23050

> Flaky test: StreamingQueryListenersConfSuite
> 
>
> Key: SPARK-26079
> URL: https://issues.apache.org/jira/browse/SPARK-26079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> We've had this test fail a few times in our builds.
> {noformat}
> org.scalatest.exceptions.TestFailedException: null equaled null
>   at 
> org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45)
>   at 
> org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
> {noformat}
> You can reproduce it reliably by adding a sleep in the test listener. Fix 
> coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite

2018-11-15 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-26079:
--

 Summary: Flaky test: StreamingQueryListenersConfSuite
 Key: SPARK-26079
 URL: https://issues.apache.org/jira/browse/SPARK-26079
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 2.4.0
Reporter: Marcelo Vanzin


We've had this test fail a few times in our builds.

{noformat}
org.scalatest.exceptions.TestFailedException: null equaled null
  at 
org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
  at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501)
  at 
org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45)
  at 
org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
{noformat}

You can reproduce it reliably by adding a sleep in the test listener. Fix 
coming up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >