[jira] [Created] (SPARK-26086) Spark streaming max records per batch interval
vijayant soni created SPARK-26086: - Summary: Spark streaming max records per batch interval Key: SPARK-26086 URL: https://issues.apache.org/jira/browse/SPARK-26086 Project: Spark Issue Type: Bug Components: DStreams Affects Versions: 2.3.1 Reporter: vijayant soni We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval
[ https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijayant soni updated SPARK-26086: -- Affects Version/s: (was: 2.3.2) 2.3.0 > Spark streaming max records per batch interval > -- > > Key: SPARK-26086 > URL: https://issues.apache.org/jira/browse/SPARK-26086 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.0 >Reporter: vijayant soni >Priority: Major > > We have an Spark Streaming application that reads from Kinesis and writes to > Redshift. > *Configuration*: > Number of receivers = 5 > Batch interval = 10 mins > spark.streaming.receiver.maxRate = 2000 (records per second) > According to this config, the max records that can be read in a single batch > can be calculated using below formula: > > {noformat} > Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 > (number of receivers) * 2000 (max records per second per receiver) > 10 * 60 * 5 * 2000 = 6,000,000 > {noformat} > > But the actual number of records is more that the max number. > Batch I - 6,005,886 records > Batch II - 6,001,623 records > Batch III - 6,010,148 records > Please note that receivers are not even reading at the max rate, the records > read per receiver are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval
[ https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijayant soni updated SPARK-26086: -- Affects Version/s: (was: 2.3.1) 2.3.2 > Spark streaming max records per batch interval > -- > > Key: SPARK-26086 > URL: https://issues.apache.org/jira/browse/SPARK-26086 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.2 >Reporter: vijayant soni >Priority: Major > > We have an Spark Streaming application that reads from Kinesis and writes to > Redshift. > *Configuration*: > Number of receivers = 5 > Batch interval = 10 mins > spark.streaming.receiver.maxRate = 2000 (records per second) > According to this config, the max records that can be read in a single batch > can be calculated using below formula: > > {noformat} > Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 > (number of receivers) * 2000 (max records per second per receiver) > 10 * 60 * 5 * 2000 = 6,000,000 > {noformat} > > But the actual number of records is more that the max number. > Batch I - 6,005,886 records > Batch II - 6,001,623 records > Batch III - 6,010,148 records > Please note that receivers are not even reading at the max rate, the records > read per receiver are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval
[ https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijayant soni updated SPARK-26086: -- Description: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {noformat} Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 {noformat} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver are near 1900 per second. was: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver are near 1900 per second. > Spark streaming max records per batch interval > -- > > Key: SPARK-26086 > URL: https://issues.apache.org/jira/browse/SPARK-26086 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.1 >Reporter: vijayant soni >Priority: Major > > We have an Spark Streaming application that reads from Kinesis and writes to > Redshift. > *Configuration*: > Number of receivers = 5 > Batch interval = 10 mins > spark.streaming.receiver.maxRate = 2000 (records per second) > According to this config, the max records that can be read in a single batch > can be calculated using below formula: > > {noformat} > Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 > (number of receivers) * 2000 (max records per second per receiver) > 10 * 60 * 5 * 2000 = 6,000,000 > {noformat} > > But the actual number of records is more that the max number. > Batch I - 6,005,886 records > Batch II - 6,001,623 records > Batch III - 6,010,148 records > Please note that receivers are not even reading at the max rate, the records > read per receiver are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval
[ https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijayant soni updated SPARK-26086: -- Description: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver are near 1900 per second. was: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver per second are near 1900 per second. > Spark streaming max records per batch interval > -- > > Key: SPARK-26086 > URL: https://issues.apache.org/jira/browse/SPARK-26086 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.1 >Reporter: vijayant soni >Priority: Major > > We have an Spark Streaming application that reads from Kinesis and writes to > Redshift. > *Configuration*: > Number of receivers = 5 > Batch interval = 10 mins > spark.streaming.receiver.maxRate = 2000 (records per second) > According to this config, the max records that can be read in a single batch > can be calculated using below formula: > {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 > (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * > 5 * 2000 = 6,000,000 }} > But the actual number of records is more that the max number. > Batch I - 6,005,886 records > Batch II - 6,001,623 records > Batch III - 6,010,148 records > Please note that receivers are not even reading at the max rate, the records > read per receiver are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26086) Spark streaming max records per batch interval
[ https://issues.apache.org/jira/browse/SPARK-26086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vijayant soni updated SPARK-26086: -- Description: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver per second are near 1900 per second. was: We have an Spark Streaming application that reads from Kinesis and writes to Redshift. *Configuration*: Number of receivers = 5 Batch interval = 10 mins spark.streaming.receiver.maxRate = 2000 (records per second) According to this config, the max records that can be read in a single batch can be calculated using below formula: {{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * 5 * 2000 = 6,000,000 }} But the actual number of records is more that the max number. Batch I - 6,005,886 records Batch II - 6,001,623 records Batch III - 6,010,148 records Please note that receivers are not even reading at the max rate, the records read per receiver are near 1900 per second. > Spark streaming max records per batch interval > -- > > Key: SPARK-26086 > URL: https://issues.apache.org/jira/browse/SPARK-26086 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.3.1 >Reporter: vijayant soni >Priority: Major > > We have an Spark Streaming application that reads from Kinesis and writes to > Redshift. > *Configuration*: > Number of receivers = 5 > Batch interval = 10 mins > spark.streaming.receiver.maxRate = 2000 (records per second) > According to this config, the max records that can be read in a single batch > can be calculated using below formula: > {\{Max records per batch = batch_interval * 60 (convert mins to seconds) * 5 > (number of receivers) * 2000 (max records per second per receiver) 10 * 60 * > 5 * 2000 = 6,000,000 }} > But the actual number of records is more that the max number. > Batch I - 6,005,886 records > Batch II - 6,001,623 records > Batch III - 6,010,148 records > Please note that receivers are not even reading at the max rate, the records > read per receiver per second are near 1900 per second. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute
[ https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689056#comment-16689056 ] Ruslan Dautkhanov commented on SPARK-26041: --- [~hyukjin.kwon] I didn't request investigation. I hope creating jira and explaining how it happens may help somebody else to solve their problem too, no? If you haven't noticed this jira has a sequence of SQLs attached as a txt file that trigger this problem. There are a couple of other jiras SPARK-13480 and SPARK-12940 that seem relevant but were also closed as couldn't reproduce. I think there is a long-standing problem when Catalyst excessively overoptimizes and cuts some columns form lineage excessively. I thought by reporting problems here we help make Spark better, no? Unfortunately closing jira as can't reproduce doesn't make this problem disappear. Having said that, I will try to make a reproducible case and upload here, in addition to SQLs that are already attached. > catalyst cuts out some columns from dataframes: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute > - > > Key: SPARK-26041 > URL: https://issues.apache.org/jira/browse/SPARK-26041 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > Environment: Spark 2.3.2 > Hadoop 2.6 > When we materialize one of intermediate dataframes as a parquet table, and > read it back in, this error doesn't happen (exact same downflow queries ). > >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: catalyst, optimization > Attachments: SPARK-26041.txt > > > There is a workflow with a number of group-by's, joins, `exists` and `in`s > between a set of dataframes. > We are getting following exception and the reason that the Catalyst cuts some > columns out of dataframes: > {noformat} > Unhandled error: , An error occurred > while calling o1187.cache. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 > in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage > 2011.0 (TID 832340, pc1udatahad23, execut > or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: part_code#56012 > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318) > at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210) >
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689052#comment-16689052 ] Ruslan Dautkhanov commented on SPARK-26019: --- No, it was the only instance I had for this problem. I will ask again that user who ran into this. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov resolved SPARK-26019. --- Resolution: Cannot Reproduce > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26078) WHERE .. IN fails to filter rows when used in combination with UNION
[ https://issues.apache.org/jira/browse/SPARK-26078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689040#comment-16689040 ] Wenchen Fan commented on SPARK-26078: - looks like a bug when we rewrite correlated subquery, cc [~viirya] [~mgaido] > WHERE .. IN fails to filter rows when used in combination with UNION > > > Key: SPARK-26078 > URL: https://issues.apache.org/jira/browse/SPARK-26078 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0 >Reporter: Arttu Voutilainen >Priority: Blocker > Labels: correctness > > Hey, > We encountered a case where Spark SQL does not seem to handle WHERE .. IN > correctly, when used in combination with UNION, but instead returns also rows > that do not fulfill the condition. Swapping the order of the datasets in the > UNION makes the problem go away. Repro below: > > {code} > sql = SQLContext(sc) > a = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > b = spark.createDataFrame([{'id': 'a', 'num': 2}, {'id':'b', 'num':1}]) > a.registerTempTable('a') > b.registerTempTable('b') > bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'a' as source FROM a > UNION ALL > SELECT id, num, 'b' as source FROM b > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > no_bug = sql.sql(""" > SELECT id,num,source FROM > ( > SELECT id, num, 'b' as source FROM b > UNION ALL > SELECT id, num, 'a' as source FROM a > ) AS c > WHERE c.id IN (SELECT id FROM b WHERE num = 2) > """) > bug.show() > no_bug.show() > bug.explain(True) > no_bug.explain(True) > {code} > This results in one extra row in the "bug" DF coming from DF "b", that should > not be there as it > {code:java} > >>> bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| a| > | a| 2| b| > | b| 1| b| > +---+---+--+ > >>> no_bug.show() > +---+---+--+ > | id|num|source| > +---+---+--+ > | a| 2| b| > | a| 2| a| > +---+---+--+ > {code} > The reason can be seen in the query plans: > {code:java} > >>> bug.explain(True) > ... > == Optimized Logical Plan == > Union > :- Project [id#0, num#1L, a AS source#136] > : +- Join LeftSemi, (id#0 = id#4) > : :- LogicalRDD [id#0, num#1L], false > : +- Project [id#4] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Join LeftSemi, (id#4#172 = id#4#172) >:- Project [id#4, num#5L, b AS source#137] >: +- LogicalRDD [id#4, num#5L], false >+- Project [id#4 AS id#4#172] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > Note the line *+- Join LeftSemi, (id#4#172 = id#4#172)* - this condition > seems wrong, and I believe it causes the LeftSemi to return true for all rows > in the left-hand-side table, thus failing to filter as the WHERE .. IN > should. Compare with the non-buggy version, where both LeftSemi joins have > distinct #-things on both sides: > {code:java} > >>> no_bug.explain() > ... > == Optimized Logical Plan == > Union > :- Project [id#4, num#5L, b AS source#142] > : +- Join LeftSemi, (id#4 = id#4#173) > : :- LogicalRDD [id#4, num#5L], false > : +- Project [id#4 AS id#4#173] > :+- Filter (isnotnull(num#5L) && (num#5L = 2)) > : +- LogicalRDD [id#4, num#5L], false > +- Project [id#0, num#1L, a AS source#143] >+- Join LeftSemi, (id#0 = id#4#173) > :- LogicalRDD [id#0, num#1L], false > +- Project [id#4 AS id#4#173] > +- Filter (isnotnull(num#5L) && (num#5L = 2)) > +- LogicalRDD [id#4, num#5L], false > {code} > > Best, > -Arttu > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24255) Require Java 8 in SparkR description
[ https://issues.apache.org/jira/browse/SPARK-24255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16689038#comment-16689038 ] Shivaram Venkataraman commented on SPARK-24255: --- This is a great list -- I dont think we are able to handle all of these scenarios ? [~kiszk] do you know of any existing library that parses all the version strings ? > Require Java 8 in SparkR description > > > Key: SPARK-24255 > URL: https://issues.apache.org/jira/browse/SPARK-24255 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Shivaram Venkataraman >Assignee: Shivaram Venkataraman >Priority: Major > Fix For: 2.3.1, 2.4.0 > > > CRAN checks require that the Java version be set both in package description > and checked during runtime. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20236) Overwrite a partitioned data source table should only overwrite related partitions
[ https://issues.apache.org/jira/browse/SPARK-20236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688995#comment-16688995 ] Wenchen Fan commented on SPARK-20236: - This looks like a bug to me. Can you come up with a simple code snippet to reproduce this issue and create a ticket? I'll take a closer look. Thanks! > Overwrite a partitioned data source table should only overwrite related > partitions > -- > > Key: SPARK-20236 > URL: https://issues.apache.org/jira/browse/SPARK-20236 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: releasenotes > Fix For: 2.3.0 > > > When we overwrite a partitioned data source table, currently Spark will > truncate the entire table to write new data, or truncate a bunch of > partitions according to the given static partitions. > For example, {{INSERT OVERWRITE tbl ...}} will truncate the entire table, > {{INSERT OVERWRITE tbl PARTITION (a=1, b)}} will truncate all the partitions > that starts with {{a=1}}. > This behavior is kind of reasonable as we can know which partitions will be > overwritten before runtime. However, hive has a different behavior that it > only overwrites related partitions, e.g. {{INSERT OVERWRITE tbl SELECT > 1,2,3}} will only overwrite partition {{a=2, b=3}}, assuming {{tbl}} has only > one data column and is partitioned by {{a}} and {{b}}. > It seems better if we can follow hive's behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688997#comment-16688997 ] Sean Owen commented on SPARK-26026: --- Ah, I think it was this: {code} /Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/AlphaComponent.java:33: warning: Implementation restriction: subclassing Classfile does not make your annotation visible at runtime. If that is what you want, you must write the annotation class in Java. public @interface AlphaComponent {} ^ /Users/seanowen/Documents/spark_2.11/common/tags/src/main/java/org/apache/spark/annotation/DeveloperApi.java:36: warning: Implementation restriction: subclassing Classfile does not make your annotation visible at runtime. If that is what you want, you must write the annotation class in Java. public @interface DeveloperApi {} ^ ... {code} It may be that we have to port the annotations to make it work. > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688978#comment-16688978 ] Hyukjin Kwon commented on SPARK-26056: -- Looks we should fix. > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > Attachments: sql.jpg > > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688976#comment-16688976 ] wish commented on SPARK-26056: -- [~hyukjin.kwon] done > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > Attachments: sql.jpg > > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688973#comment-16688973 ] Sean Owen commented on SPARK-26026: --- Hm, I don't recall why I removed that now. It could have been some issue generating scaladoc artifacts with 2.12. Let me try re-enabling it to see whether there is an issue now or not. While it's not super important to publish them as artifacts, I don't think we intended to stop. > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns
[ https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688964#comment-16688964 ] Hyukjin Kwon commented on SPARK-26067: -- I don't think we should fix this. This is something we should fix so it's fixed in Python 3.7 in any event. > Pandas GROUPED_MAP udf breaks if DF has >255 columns > > > Key: SPARK-26067 > URL: https://issues.apache.org/jira/browse/SPARK-26067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Abdeali Kothari >Priority: Major > > When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in > pythohn/pandas on a grouped dataframe in spark - it fails if the number of > columns is greater than 255 in Pytohn 3.6 and lower. > {code:java} > import pyspark > from pyspark.sql import types as T, functions as F > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame( > [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + > str(i) for i in range(256)]) > new_schema = T.StructType([ > field for field in df.schema] + [T.StructField("new_row", > T.DoubleType())]) > def myfunc(df): > df['new_row'] = 1 > return df > myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc) > df2 = df.groupBy(["a1"]).apply(myfunc_udf) > print(df2.count()) # This FAILS > # ERROR: > # Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > # func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > # mapper = eval(mapper_str, udfs) > # File "", line 1 > # SyntaxError: more than 255 arguments > {code} > Note: In Python 3.7 the 255 limit was raised, but I have not tried with > Pytohn 3.7 > ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes > I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my > Hadoop Linux cluster and also on my Mac standalone spark installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26064) Unable to fetch jar from remote repo while running spark-submit on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688961#comment-16688961 ] Hyukjin Kwon commented on SPARK-26064: -- Is it a question or an issue? > Unable to fetch jar from remote repo while running spark-submit on kubernetes > - > > Key: SPARK-26064 > URL: https://issues.apache.org/jira/browse/SPARK-26064 > Project: Spark > Issue Type: Question > Components: Kubernetes >Affects Versions: 2.3.2 >Reporter: Bala Bharath Reddy Resapu >Priority: Major > > I am trying to run spark on kubernetes with a docker image. My requirement is > to download the jar from the external repo while running spark-submit. I am > able to download the jar using wget in the container but it doesn't work when > inputting in the spark-submit command. I am not packaging the jar with docker > image. It works fine when I input the jar file inside the docker image. > > ./bin/spark-submit \ > --master k8s://[https://ip:port|https://ipport/] \ > --deploy-mode cluster \ > --name test3 \ > --class hello \ > --conf spark.kubernetes.container.image.pullSecrets=abcd \ > --conf spark.kubernetes.container.image=spark:h2.0 \ > [https://devops.com/artifactory/local/testing/testing_2.11/h|https://bala.bharath.reddy.resapu%40ibm.com:akcp5bcbktykg2ti28sju4gtebsqwkg2mqkaf9w6g5rdbo3iwrwx7qb1m5dokgd54hdru2...@na.artifactory.swg-devops.com/artifactory/txo-cedp-garage-artifacts-sbt-local/testing/testing_2.11/arithmetic.jar]ello.jar -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided
[ https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688968#comment-16688968 ] Hyukjin Kwon commented on SPARK-26027: -- Oh, right. Let me keep this in mind and reopen this if it causes an actual problem. Should be not a problem for now. > Unable to build Spark for Scala 2.12 with Maven script provided > --- > > Key: SPARK-26027 > URL: https://issues.apache.org/jira/browse/SPARK-26027 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Jason Moore >Priority: Minor > > In ./build/mvn, from pom.xml is used to determine which Scala > library to fetch but it doesn't seem to use the value under the scala-2.12 > profile even if that is set. > The result is that the maven build still uses scala-library 2.11.12 and > compilation fails. > Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think > that only updates scala.binary.version) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26034: Assignee: Bryan Cutler (was: Apache Spark) > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wish updated SPARK-26056: - Attachment: sql.jpg > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > Attachments: sql.jpg > > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26077) Reserved SQL words are not escaped by JDBC writer for table name
[ https://issues.apache.org/jira/browse/SPARK-26077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688967#comment-16688967 ] Hyukjin Kwon commented on SPARK-26077: -- cc'ing [~maropu] FYI > Reserved SQL words are not escaped by JDBC writer for table name > > > Key: SPARK-26077 > URL: https://issues.apache.org/jira/browse/SPARK-26077 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: Eugene Golovan >Priority: Major > > This bug is similar to SPARK-16387 but this time table name is not escaped. > How to reproduce: > 1/ Start spark shell with mysql connector > spark-shell --jars ./mysql-connector-java-8.0.13.jar > > 2/ Execute next code > > import spark.implicits._ > (spark > .createDataset(Seq("a","b","c")) > .toDF("order") > .write > .format("jdbc") > .option("url", s"jdbc:mysql://root@localhost:3306/test") > .option("driver", "com.mysql.cj.jdbc.Driver") > .option("dbtable", "condition") > .save) > > , where condition - is reserved word. > > Error message: > > java.sql.SQLSyntaxErrorException: You have an error in your SQL syntax; check > the manual that corresponds to your MySQL server version for the right syntax > to use near 'condition (`order` TEXT )' at line 1 > at > com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120) > at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97) > at > com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) > at > com.mysql.cj.jdbc.StatementImpl.executeUpdateInternal(StatementImpl.java:1355) > at > com.mysql.cj.jdbc.StatementImpl.executeLargeUpdate(StatementImpl.java:2128) > at com.mysql.cj.jdbc.StatementImpl.executeUpdate(StatementImpl.java:1264) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:844) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:95) > at > org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:656) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:656) > at > org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:273) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:267) > ... 59 elided > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26075) Cannot broadcast the table that is larger than 8GB : Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-26075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688966#comment-16688966 ] Hyukjin Kwon commented on SPARK-26075: -- Does this happen in Spark 2.4 as well? > Cannot broadcast the table that is larger than 8GB : Spark 2.3 > -- > > Key: SPARK-26075 > URL: https://issues.apache.org/jira/browse/SPARK-26075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Neeraj Bhadani >Priority: Major > > I am trying to use the broadcast join but getting below error in Spark 2.3. > However, the same code is working fine in Spark 2.2 > > Upon checking the size of the dataframes its merely 50 MB and I have set the > threshold to 200 MB as well. As I mentioned above same code is working fine > in Spark 2.2 > > {{Error: "Cannot broadcast the table that is larger than 8GB". }} > However, Disabling the broadcasting is working fine. > {{'spark.sql.autoBroadcastJoinThreshold': '-1'}} > > {{Regards,}} > {{Neeraj}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26067) Pandas GROUPED_MAP udf breaks if DF has >255 columns
[ https://issues.apache.org/jira/browse/SPARK-26067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26067. -- Resolution: Not A Problem > Pandas GROUPED_MAP udf breaks if DF has >255 columns > > > Key: SPARK-26067 > URL: https://issues.apache.org/jira/browse/SPARK-26067 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Abdeali Kothari >Priority: Major > > When I run spark's Pandas GROUPED_MAP udfs to apply a UDAF i wrote in > pythohn/pandas on a grouped dataframe in spark - it fails if the number of > columns is greater than 255 in Pytohn 3.6 and lower. > {code:java} > import pyspark > from pyspark.sql import types as T, functions as F > spark = pyspark.sql.SparkSession.builder.getOrCreate() > df = spark.createDataFrame( > [[i for i in range(256)], [i+1 for i in range(256)]], schema=["a" + > str(i) for i in range(256)]) > new_schema = T.StructType([ > field for field in df.schema] + [T.StructField("new_row", > T.DoubleType())]) > def myfunc(df): > df['new_row'] = 1 > return df > myfunc_udf = F.pandas_udf(new_schema, F.PandasUDFType.GROUPED_MAP)(myfunc) > df2 = df.groupBy(["a1"]).apply(myfunc_udf) > print(df2.count()) # This FAILS > # ERROR: > # Caused by: org.apache.spark.api.python.PythonException: Traceback (most > recent call last): > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 219, in main > # func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, > eval_type) > # File > "/usr/local/hadoop/spark2.3.1/python/lib/pyspark.zip/pyspark/worker.py", line > 148, in read_udfs > # mapper = eval(mapper_str, udfs) > # File "", line 1 > # SyntaxError: more than 255 arguments > {code} > Note: In Python 3.7 the 255 limit was raised, but I have not tried with > Pytohn 3.7 > ...https://docs.python.org/3.7/whatsnew/3.7.html#other-language-changes > I was using Python 3.5 (from anaconda), Spark 2.3.1 to reproduce thihs on my > Hadoop Linux cluster and also on my Mac standalone spark installation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26063) CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on unresolved object" when requested for numberedTreeString
[ https://issues.apache.org/jira/browse/SPARK-26063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26063. -- Resolution: Duplicate > CatalystDataToAvro gives "UnresolvedException: Invalid call to dataType on > unresolved object" when requested for numberedTreeString > --- > > Key: SPARK-26063 > URL: https://issues.apache.org/jira/browse/SPARK-26063 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jacek Laskowski >Priority: Major > > The following gives > {{org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'id}}: > {code:java} > // ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:2.4.0 > scala> spark.version > res0: String = 2.4.0 > import org.apache.spark.sql.avro._ > val q = spark.range(1).withColumn("to_avro_id", to_avro('id)) > val logicalPlan = q.queryExecution.logical > scala> logicalPlan.expressions.drop(1).head.numberedTreeString > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'id > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105) > at > org.apache.spark.sql.avro.CatalystDataToAvro.simpleString(CatalystDataToAvro.scala:56) > at > org.apache.spark.sql.catalyst.expressions.Expression.verboseString(Expression.scala:233) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:548) > at > org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:569) > at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:472) > at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:469) > at > org.apache.spark.sql.catalyst.trees.TreeNode.numberedTreeString(TreeNode.scala:483) > ... 51 elided{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688962#comment-16688962 ] Apache Spark commented on SPARK-26034: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/23056 > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688963#comment-16688963 ] Apache Spark commented on SPARK-26034: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/23056 > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Bryan Cutler >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26034) Break large mllib/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26034: Assignee: Apache Spark (was: Bryan Cutler) > Break large mllib/tests.py files into smaller files > --- > > Key: SPARK-26034 > URL: https://issues.apache.org/jira/browse/SPARK-26034 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided
[ https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957 ] Jason Moore edited comment on SPARK-26027 at 11/16/18 3:10 AM: --- I was originally going to withdraw the ticket when I discovered my actual issue, and happy for that to happen. The main concern I was left with was that the build scripts download based on the default Scala version (2.11.12 on v2.4.0 tag) rather than taking the profile flag into account). If you don't see this as an issue to worry about, close this ticket and forget all about it. was (Author: jasonmoore2k): I was originally going to withdraw the ticket when I discovered my actual issue, and happy for that to happen. The main concern I was left with was that the build scripts download based on the default Scala version (2.11.12 on v2.4.0 tag) rater than taking the profile flag into account). If you don't see this as an issue to worry about, close this ticket and forget all about it. > Unable to build Spark for Scala 2.12 with Maven script provided > --- > > Key: SPARK-26027 > URL: https://issues.apache.org/jira/browse/SPARK-26027 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Jason Moore >Priority: Minor > > In ./build/mvn, from pom.xml is used to determine which Scala > library to fetch but it doesn't seem to use the value under the scala-2.12 > profile even if that is set. > The result is that the maven build still uses scala-library 2.11.12 and > compilation fails. > Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think > that only updates scala.binary.version) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26059) Spark standalone mode, does not correctly record a failed Spark Job.
[ https://issues.apache.org/jira/browse/SPARK-26059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688958#comment-16688958 ] Hyukjin Kwon commented on SPARK-26059: -- Can you also describe reproducer, and output (if possible screenshot) as well? > Spark standalone mode, does not correctly record a failed Spark Job. > > > Key: SPARK-26059 > URL: https://issues.apache.org/jira/browse/SPARK-26059 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core >Affects Versions: 3.0.0 >Reporter: Prashant Sharma >Priority: Major > > In order to reproduce submit a failing job to spark standalone master. The > status for the failed job is shown as FINISHED, irrespective of the fact it > failed or succeeded. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided
[ https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688957#comment-16688957 ] Jason Moore commented on SPARK-26027: - I was originally going to withdraw the ticket when I discovered my actual issue, and happy for that to happen. The main concern I was left with was that the build scripts download based on the default Scala version (2.11.12 on v2.4.0 tag) rater than taking the profile flag into account). If you don't see this as an issue to worry about, close this ticket and forget all about it. > Unable to build Spark for Scala 2.12 with Maven script provided > --- > > Key: SPARK-26027 > URL: https://issues.apache.org/jira/browse/SPARK-26027 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Jason Moore >Priority: Minor > > In ./build/mvn, from pom.xml is used to determine which Scala > library to fetch but it doesn't seem to use the value under the scala-2.12 > profile even if that is set. > The result is that the maven build still uses scala-library 2.11.12 and > compilation fails. > Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think > that only updates scala.binary.version) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26056) java api spark streaming spark-avro ui
[ https://issues.apache.org/jira/browse/SPARK-26056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688956#comment-16688956 ] Hyukjin Kwon commented on SPARK-26056: -- Can you upload screenshots? > java api spark streaming spark-avro ui > --- > > Key: SPARK-26056 > URL: https://issues.apache.org/jira/browse/SPARK-26056 > Project: Spark > Issue Type: Bug > Components: SQL, Structured Streaming, Web UI >Affects Versions: 2.3.2 >Reporter: wish >Priority: Major > > when i use java api spark streaming to read kafka and save avro( databricks > spark-avro dependency) > spark ui :the SQL tabs repeat again and again > > but scala api no problem > > normal ui like this: > * [Jobs|http://ebs-ali-beijing-datalake1:4044/jobs/] > * [Stages|http://ebs-ali-beijing-datalake1:4044/stages/] > * [Storage|http://ebs-ali-beijing-datalake1:4044/storage/] > * [Environment|http://ebs-ali-beijing-datalake1:4044/environment/] > * [Executors|http://ebs-ali-beijing-datalake1:4044/executors/] > * [SQL|http://ebs-ali-beijing-datalake1:4044/SQL/] > * [Streaming|http://ebs-ali-beijing-datalake1:4044/streaming/] > but java api ui like this: > Jobs Stages Storage Environment Executors SQL Streaming SQL SQL SQL SQL SQL > SQL ..SQL -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26050) Implment withColumnExpr method on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688955#comment-16688955 ] Hyukjin Kwon commented on SPARK-26050: -- It's easily able to work around. Currently Spark has too many APIs open. Let's avoid to add new APIs unless it's strongly needed. > Implment withColumnExpr method on DataFrame > --- > > Key: SPARK-26050 > URL: https://issues.apache.org/jira/browse/SPARK-26050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mathew >Priority: Major > > Currently we provide some syntactic sugar in the form of df.selectExpr(), > which effectively executes as df.select(expr(), expr(), ...) > I propose we implement a df.withColumnExpr(), which behaves similarly to > df.withColumn(), except without the colName parameter, instead taking column > names from the expressions themselves. > This would stop the unfriendly paradigm of chained > .withColumn().withColumn().withColumn() expressions, as we could allow > passing as many column expressions as you want. > Similar to df.selectExpr(), we should support all of: 'column names', 'column > expressions', 'column string expressions' as inputs. > Comments are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26050) Implment withColumnExpr method on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-26050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26050. -- Resolution: Won't Fix > Implment withColumnExpr method on DataFrame > --- > > Key: SPARK-26050 > URL: https://issues.apache.org/jira/browse/SPARK-26050 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Mathew >Priority: Major > > Currently we provide some syntactic sugar in the form of df.selectExpr(), > which effectively executes as df.select(expr(), expr(), ...) > I propose we implement a df.withColumnExpr(), which behaves similarly to > df.withColumn(), except without the colName parameter, instead taking column > names from the expressions themselves. > This would stop the unfriendly paradigm of chained > .withColumn().withColumn().withColumn() expressions, as we could allow > passing as many column expressions as you want. > Similar to df.selectExpr(), we should support all of: 'column names', 'column > expressions', 'column string expressions' as inputs. > Comments are welcome. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository
[ https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26048: - Priority: Major (was: Blocker) > Flume connector for Spark 2.4 does not exist in Maven repository > > > Key: SPARK-26048 > URL: https://issues.apache.org/jira/browse/SPARK-26048 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Aki Tanaka >Priority: Major > > Flume connector for Spark 2.4 does not exist in the Maven repository. > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume] > > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink] > These packages will be removed in Spark 3. But Spark 2.4 branch still has > these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository
[ https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26048: - Target Version/s: (was: 2.4.1) > Flume connector for Spark 2.4 does not exist in Maven repository > > > Key: SPARK-26048 > URL: https://issues.apache.org/jira/browse/SPARK-26048 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Aki Tanaka >Priority: Major > > Flume connector for Spark 2.4 does not exist in the Maven repository. > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume] > > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink] > These packages will be removed in Spark 3. But Spark 2.4 branch still has > these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26048) Flume connector for Spark 2.4 does not exist in Maven repository
[ https://issues.apache.org/jira/browse/SPARK-26048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688951#comment-16688951 ] Hyukjin Kwon commented on SPARK-26048: -- Please avoid to set target versions and Critical+ priority which are usually reserved for committers. > Flume connector for Spark 2.4 does not exist in Maven repository > > > Key: SPARK-26048 > URL: https://issues.apache.org/jira/browse/SPARK-26048 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.4.0 >Reporter: Aki Tanaka >Priority: Blocker > > Flume connector for Spark 2.4 does not exist in the Maven repository. > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume] > > [https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-flume-sink] > These packages will be removed in Spark 3. But Spark 2.4 branch still has > these packages. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26031) dataframe can't load correct after saving to local disk in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-26031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26031. -- Resolution: Invalid > dataframe can't load correct after saving to local disk in cluster mode > --- > > Key: SPARK-26031 > URL: https://issues.apache.org/jira/browse/SPARK-26031 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 > Environment: 1 spark master > 3 spark slaves > >Reporter: Bihui Jin >Priority: Major > > Firstly I saved a spark dataframe to local disk in spark cluster mode with " > df.write \ > .format('json') \ > .save('file:///root/bughunter/', mode='overwrite') > " (using interface provide by {color:#FF}pyspark{color}) > Then I load it with " > spark.read.format('json').load('file:///root/bughunter/') > " > But it faild with " org.apache.spark.sql.AnalysisException: Unable to infer > schema for JSON. It must be specified manually." > And I check every node's disk: > In master: > only the file named "_SUCCESS" exists in /root/bughunter/; > In each slave, there is a folder named "_temporary" exists in /root/bughunter/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26047) Py4JNetworkError (on IPV6): An error occurred while trying to connect to the Java server (127.0.0.1
[ https://issues.apache.org/jira/browse/SPARK-26047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688950#comment-16688950 ] Hyukjin Kwon commented on SPARK-26047: -- Thanks for working around for it. Is it Py4J issue I assume rather then Spark's? > Py4JNetworkError (on IPV6): An error occurred while trying to connect to the > Java server (127.0.0.1 > --- > > Key: SPARK-26047 > URL: https://issues.apache.org/jira/browse/SPARK-26047 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Abdul Mateen Mohammed >Priority: Major > > On IPV6, I got the following error when pyspark is invoked: > h1. Py4JNetworkError: An error occurred while trying to connect to the Java > server (127.0.0.1...) > Where as on IPV4, it is working fine. > I realized that the issue was due to default address specified as 127.0.0.1 > in java_gateway.py under py4j-0.10.7-src.zip > Resolution: > I was able to fix it by replacing the entry > DEFAULT_ADDRESS = "127.0.0.1" with DEFAULT_ADDRESS = "::1" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26045) Error in the spark 2.4 release package with the spark-avro_2.11 depdency
[ https://issues.apache.org/jira/browse/SPARK-26045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26045: - Target Version/s: (was: 2.4.0) > Error in the spark 2.4 release package with the spark-avro_2.11 depdency > > > Key: SPARK-26045 > URL: https://issues.apache.org/jira/browse/SPARK-26045 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 > Environment: 4.15.0-38-generic #41-Ubuntu SMP Wed Oct 10 10:59:38 UTC > 2018 x86_64 x86_64 x86_64 GNU/Linux >Reporter: Oscar garcía >Priority: Major > Original Estimate: 2h > Remaining Estimate: 2h > > Hello I have been problems with the last spark 2.4 release, the read avro > file feature does not seem to be working, I have fixed it in local building > the source code and updating the *avro-1.8.2.jar* on the *$SPARK_HOME*/jars/ > dependencies. > With the default spark 2.4 release when I try to read an avro file spark > raise the following exception. > {code:java} > spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0 > scala> spark.read.format("avro").load("file.avro") > java.lang.NoSuchMethodError: > org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:51) > at > org.apache.spark.sql.avro.SchemaConverters$.toSqlTypeHelper(SchemaConverters.scala:105 > {code} > Checksum: spark-2.4.0-bin-without-hadoop.tgz: 7670E29B 59EAE7A8 5DBC9350 > 085DD1E0 F056CA13 11365306 7A6A32E9 B607C68E A8DAA666 EF053350 008D0254 > 318B70FB DE8A8B97 6586CA19 D65BA2B3 FD7F919E > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute
[ https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26041. -- Resolution: Cannot Reproduce > catalyst cuts out some columns from dataframes: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute > - > > Key: SPARK-26041 > URL: https://issues.apache.org/jira/browse/SPARK-26041 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > Environment: Spark 2.3.2 > Hadoop 2.6 > When we materialize one of intermediate dataframes as a parquet table, and > read it back in, this error doesn't happen (exact same downflow queries ). > >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: catalyst, optimization > Attachments: SPARK-26041.txt > > > There is a workflow with a number of group-by's, joins, `exists` and `in`s > between a set of dataframes. > We are getting following exception and the reason that the Catalyst cuts some > columns out of dataframes: > {noformat} > Unhandled error: , An error occurred > while calling o1187.cache. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 > in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage > 2011.0 (TID 832340, pc1udatahad23, execut > or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: part_code#56012 > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318) > at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:209) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at
[jira] [Commented] (SPARK-26041) catalyst cuts out some columns from dataframes: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute
[ https://issues.apache.org/jira/browse/SPARK-26041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688947#comment-16688947 ] Hyukjin Kwon commented on SPARK-26041: -- [~Tagar] no, don't request investigation here. Please narrow down and describe the details. No one can reproduce it for now except you. I am leaving this resolved until we get the proper information for this issue. > catalyst cuts out some columns from dataframes: > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute > - > > Key: SPARK-26041 > URL: https://issues.apache.org/jira/browse/SPARK-26041 > Project: Spark > Issue Type: Bug > Components: Optimizer, Spark Core >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 > Environment: Spark 2.3.2 > Hadoop 2.6 > When we materialize one of intermediate dataframes as a parquet table, and > read it back in, this error doesn't happen (exact same downflow queries ). > >Reporter: Ruslan Dautkhanov >Priority: Major > Labels: catalyst, optimization > Attachments: SPARK-26041.txt > > > There is a workflow with a number of group-by's, joins, `exists` and `in`s > between a set of dataframes. > We are getting following exception and the reason that the Catalyst cuts some > columns out of dataframes: > {noformat} > Unhandled error: , An error occurred > while calling o1187.cache. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 153 > in stage 2011.0 failed 4 times, most recent failure: Lost task 153.3 in stage > 2011.0 (TID 832340, pc1udatahad23, execut > or 153): org.apache.spark.sql.catalyst.errors.package$TreeNodeException: > Binding attribute, tree: part_code#56012 > at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:91) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:90) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:45) > at > org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.bind(GeneratePredicate.scala:40) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1318) > at org.apache.spark.sql.execution.SparkPlan.newPredicate(SparkPlan.scala:401) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$lzycompute(BroadcastNestedLoopJoinExec.scala:87) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition(BroadcastNestedLoopJoinExec.scala:85) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4$$anonfun$6.apply(BroadcastNestedLoopJoinExec.scala:210) > at > scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38) > at > scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:46) > at scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:186) > at > org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$5$$anonfun$apply$4.apply(BroadcastNestedLoopJoinExec.scala:210) > at >
[jira] [Resolved] (SPARK-26040) CSV Row delimiters not consistent between platforms
[ https://issues.apache.org/jira/browse/SPARK-26040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26040. -- Resolution: Duplicate > CSV Row delimiters not consistent between platforms > --- > > Key: SPARK-26040 > URL: https://issues.apache.org/jira/browse/SPARK-26040 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Heath Abelson >Priority: Major > > Running a spark job on *nix platforms, only unix style row delimiters (\n) > are recognized. When running the job on windows, only windows style > delimiters are recognized (\r\n). > The result is that, when trying to read a csv generated my MS excel, on spark > running on Linux, extra characters are included in field names and field > values that are last on the line. > Ideally, the CSV parser would be able to handle the 2 different flavors of > line endings regardless of what platform the job is being run on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26040) CSV Row delimiters not consistent between platforms
[ https://issues.apache.org/jira/browse/SPARK-26040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688943#comment-16688943 ] Hyukjin Kwon commented on SPARK-26040: -- I think this is not an issue when {{multiLine}} is disabled because we delegate newline handling to Hadoop library which deals with both cases. The problem is when {{multiLine}} is enabled. This case is fixed in https://github.com/apache/spark/pull/22503 This should be a duplicate of SPARK-25493 > CSV Row delimiters not consistent between platforms > --- > > Key: SPARK-26040 > URL: https://issues.apache.org/jira/browse/SPARK-26040 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.3.0 >Reporter: Heath Abelson >Priority: Major > > Running a spark job on *nix platforms, only unix style row delimiters (\n) > are recognized. When running the job on windows, only windows style > delimiters are recognized (\r\n). > The result is that, when trying to read a csv generated my MS excel, on spark > running on Linux, extra characters are included in field names and field > values that are last on the line. > Ideally, the CSV parser would be able to handle the 2 different flavors of > line endings regardless of what platform the job is being run on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26019) pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
[ https://issues.apache.org/jira/browse/SPARK-26019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688934#comment-16688934 ] Hyukjin Kwon commented on SPARK-26019: -- Are you able to make a simple reproducer? If it's about flakiness, we should be able to reproduce it when it's executed multiple times. > pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" > in authenticate_and_accum_updates() > > > Key: SPARK-26019 > URL: https://issues.apache.org/jira/browse/SPARK-26019 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.2, 2.4.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > Started happening after 2.3.1 -> 2.3.2 upgrade. > > {code:python} > Exception happened during processing of request from ('127.0.0.1', 43418) > > Traceback (most recent call last): > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 290, in _handle_request_noblock > self.process_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 318, in process_request > self.finish_request(request, client_address) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 331, in finish_request > self.RequestHandlerClass(request, client_address, self) > File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line > 652, in __init__ > self.handle() > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 263, in handle > poll(authenticate_and_accum_updates) > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 238, in poll > if func(): > File > "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", > line 251, in authenticate_and_accum_updates > received_token = self.rfile.read(len(auth_token)) > TypeError: object of type 'NoneType' has no len() > > {code} > > Error happens here: > https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254 > The PySpark code was just running a simple pipeline of > binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. ) > and then converting it to a dataframe and running a count on it. > It seems error is flaky - on next rerun it didn't happen. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26016. -- Resolution: Invalid > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Priority: Major > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26039) Reading an empty folder as ORC causes an Analysis Exception
[ https://issues.apache.org/jira/browse/SPARK-26039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688941#comment-16688941 ] Hyukjin Kwon commented on SPARK-26039: -- Does this happen in Spark 2.4 as well? > Reading an empty folder as ORC causes an Analysis Exception > --- > > Key: SPARK-26039 > URL: https://issues.apache.org/jira/browse/SPARK-26039 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.0 >Reporter: Abhishek Verma >Priority: Minor > > > > {\{val df = spark.read.format("orc").load(orcEmptyFolderPath) }} > > {{org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:185) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:184) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178) ... 49 > elided}} > > {{try > { spark.read.format("orc").load(path) } > catch { case ex: org.apache.spark.sql.AnalysisException => > { null } > }}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26031) dataframe can't load correct after saving to local disk in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-26031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688939#comment-16688939 ] Hyukjin Kwon commented on SPARK-26031: -- That's because you're using {{file://...}} in a cluster. The file system should usually be a distributed file system that all nodes can access. > dataframe can't load correct after saving to local disk in cluster mode > --- > > Key: SPARK-26031 > URL: https://issues.apache.org/jira/browse/SPARK-26031 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 > Environment: 1 spark master > 3 spark slaves > >Reporter: Bihui Jin >Priority: Major > > Firstly I saved a spark dataframe to local disk in spark cluster mode with " > df.write \ > .format('json') \ > .save('file:///root/bughunter/', mode='overwrite') > " (using interface provide by {color:#FF}pyspark{color}) > Then I load it with " > spark.read.format('json').load('file:///root/bughunter/') > " > But it faild with " org.apache.spark.sql.AnalysisException: Unable to infer > schema for JSON. It must be specified manually." > And I check every node's disk: > In master: > only the file named "_SUCCESS" exists in /root/bughunter/; > In each slave, there is a folder named "_temporary" exists in /root/bughunter/ > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided
[ https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688938#comment-16688938 ] Hyukjin Kwon commented on SPARK-26027: -- Is the goal to use Scala 2.12.6? The default is now changed to 2.12.6 as of https://github.com/apache/spark/commit/ad853c56788fd32e035369d1fe3d96aaf6c4ef16. This issue is obsolete and we could better leave this resolved. > Unable to build Spark for Scala 2.12 with Maven script provided > --- > > Key: SPARK-26027 > URL: https://issues.apache.org/jira/browse/SPARK-26027 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Jason Moore >Priority: Minor > > In ./build/mvn, from pom.xml is used to determine which Scala > library to fetch but it doesn't seem to use the value under the scala-2.12 > profile even if that is set. > The result is that the maven build still uses scala-library 2.11.12 and > compilation fails. > Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think > that only updates scala.binary.version) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26027) Unable to build Spark for Scala 2.12 with Maven script provided
[ https://issues.apache.org/jira/browse/SPARK-26027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26027. -- Resolution: Not A Problem > Unable to build Spark for Scala 2.12 with Maven script provided > --- > > Key: SPARK-26027 > URL: https://issues.apache.org/jira/browse/SPARK-26027 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Jason Moore >Priority: Minor > > In ./build/mvn, from pom.xml is used to determine which Scala > library to fetch but it doesn't seem to use the value under the scala-2.12 > profile even if that is set. > The result is that the maven build still uses scala-library 2.11.12 and > compilation fails. > Am I missing a step? (I do run ./dev/change-scala-version.sh 2.12 but I think > that only updates scala.binary.version) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26026) Published Scaladoc jars missing from Maven Central
[ https://issues.apache.org/jira/browse/SPARK-26026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688937#comment-16688937 ] Hyukjin Kwon commented on SPARK-26026: -- cc [~srowen] > Published Scaladoc jars missing from Maven Central > -- > > Key: SPARK-26026 > URL: https://issues.apache.org/jira/browse/SPARK-26026 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0 >Reporter: Long Cao >Priority: Minor > > For 2.3.x and beyond, it appears that published *-javadoc.jars are missing. > For concrete examples: > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.1/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.2/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > * [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.4.0/] > * > [https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.12/2.4.0/|https://repo1.maven.org/maven2/org/apache/spark/spark-core_2.11/2.3.0/] > After some searching, I'm venturing a guess that [this > commit|https://github.com/apache/spark/commit/12ab7f7e89ec9e102859ab3b710815d3058a2e8d#diff-600376dffeb79835ede4a0b285078036L2033] > removed packaging Scaladoc with the rest of the distribution. > I don't think it's a huge problem since the versioned Scaladocs are hosted on > apache.org, but I use an external documentation/search tool > ([Dash|https://kapeli.com/dash]) that operates by looking up published > javadoc jars and it'd be nice to have these available. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26016) Encoding not working when using a map / mapPartitions call
[ https://issues.apache.org/jira/browse/SPARK-26016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688932#comment-16688932 ] Hyukjin Kwon commented on SPARK-26016: -- Let's avoid to ask investigation in JIRA. It sounds more appropriate to ask it to mailing list. Let's discuss this in mailing list first and file a bug here when we're clear if it's a bug. Let me leave this resolved. > Encoding not working when using a map / mapPartitions call > -- > > Key: SPARK-26016 > URL: https://issues.apache.org/jira/browse/SPARK-26016 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Chris Caspanello >Priority: Major > Attachments: spark-sandbox.zip > > > Attached you will find a project with unit tests showing the issue at hand. > If I read in a ISO-8859-1 encoded file and simply write out what was read; > the contents in the part file matches what was read. Which is great. > However, the second I use a map / mapPartitions function it looks like the > encoding is not correct. In addition a simple collectAsList and writing that > list of strings to a file does not work either. I don't think I'm doing > anything wrong. Can someone please investigate? I think this is a bug. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25992) Accumulators giving KeyError in pyspark
[ https://issues.apache.org/jira/browse/SPARK-25992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688931#comment-16688931 ] Hyukjin Kwon commented on SPARK-25992: -- Sounds unclear if it's an issue within Spark or not. Would you be interested in continuing investigation? > Accumulators giving KeyError in pyspark > --- > > Key: SPARK-25992 > URL: https://issues.apache.org/jira/browse/SPARK-25992 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 >Reporter: Abdeali Kothari >Priority: Major > > I am using accumulators and when I run my code, I sometimes get some warn > messages. When I checked, there was nothing accumulated - not sure if I lost > info from the accumulator or it worked and I can ignore this error ? > The message: > {noformat} > Exception happened during processing of request from > ('127.0.0.1', 62099) > Traceback (most recent call last): > File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 317, in > _handle_request_noblock > self.process_request(request, client_address) > File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 348, in > process_request > self.finish_request(request, client_address) > File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 361, in > finish_request > self.RequestHandlerClass(request, client_address, self) > File "/Users/abdealijk/anaconda3/lib/python3.6/socketserver.py", line 696, in > __init__ > self.handle() > File "/usr/local/hadoop/spark2.3.1/python/pyspark/accumulators.py", line 238, > in handle > _accumulatorRegistry[aid] += update > KeyError: 0 > > 2018-11-09 19:09:08 ERROR DAGScheduler:91 - Failed to update accumulators for > task 0 > org.apache.spark.SparkException: EOF reached before Python server acknowledged > at > org.apache.spark.api.python.PythonAccumulatorV2.merge(PythonRDD.scala:634) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1131) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1123) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:1123) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1206) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26080: - Target Version/s: 2.4.1, 3.0.0 (was: 3.0.0) > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26080: Assignee: (was: Apache Spark) > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26080: Assignee: Apache Spark > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Assignee: Apache Spark >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688904#comment-16688904 ] Apache Spark commented on SPARK-26080: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23055 > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too
[ https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688900#comment-16688900 ] Apache Spark commented on SPARK-26085: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/23054 > Key attribute of primitive type under typed aggregation should be named as > "key" too > > > Key: SPARK-26085 > URL: https://issues.apache.org/jira/browse/SPARK-26085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > When doing typed aggregation on a Dataset, for complex key type, the key > attribute is named as "key". But for primitive type, the key attribute is > named as "value". This key attribute should also be named as "key" for > primitive type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too
[ https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688902#comment-16688902 ] Apache Spark commented on SPARK-26085: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/23054 > Key attribute of primitive type under typed aggregation should be named as > "key" too > > > Key: SPARK-26085 > URL: https://issues.apache.org/jira/browse/SPARK-26085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > When doing typed aggregation on a Dataset, for complex key type, the key > attribute is named as "key". But for primitive type, the key attribute is > named as "value". This key attribute should also be named as "key" for > primitive type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too
[ https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26085: Assignee: Apache Spark > Key attribute of primitive type under typed aggregation should be named as > "key" too > > > Key: SPARK-26085 > URL: https://issues.apache.org/jira/browse/SPARK-26085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Major > > When doing typed aggregation on a Dataset, for complex key type, the key > attribute is named as "key". But for primitive type, the key attribute is > named as "value". This key attribute should also be named as "key" for > primitive type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too
[ https://issues.apache.org/jira/browse/SPARK-26085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26085: Assignee: (was: Apache Spark) > Key attribute of primitive type under typed aggregation should be named as > "key" too > > > Key: SPARK-26085 > URL: https://issues.apache.org/jira/browse/SPARK-26085 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > When doing typed aggregation on a Dataset, for complex key type, the key > attribute is named as "key". But for primitive type, the key attribute is > named as "value". This key attribute should also be named as "key" for > primitive type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26085) Key attribute of primitive type under typed aggregation should be named as "key" too
Liang-Chi Hsieh created SPARK-26085: --- Summary: Key attribute of primitive type under typed aggregation should be named as "key" too Key: SPARK-26085 URL: https://issues.apache.org/jira/browse/SPARK-26085 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Liang-Chi Hsieh When doing typed aggregation on a Dataset, for complex key type, the key attribute is named as "key". But for primitive type, the key attribute is named as "value". This key attribute should also be named as "key" for primitive type. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26080: - Priority: Blocker (was: Major) > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26080: - Target Version/s: 3.0.0 > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Blocker > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1665#comment-1665 ] Hyukjin Kwon commented on SPARK-26080: -- We should fix this. > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Priority: Major > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688869#comment-16688869 ] Nagaram Prasad Addepally commented on SPARK-25957: -- Posted PR [https://github.com/apache/spark/pull/23053] for this ticket. > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688871#comment-16688871 ] Apache Spark commented on SPARK-25957: -- User 'ramaddepally' has created a pull request for this issue: https://github.com/apache/spark/pull/23053 > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688870#comment-16688870 ] Apache Spark commented on SPARK-25957: -- User 'ramaddepally' has created a pull request for this issue: https://github.com/apache/spark/pull/23053 > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25957: Assignee: Apache Spark > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Assignee: Apache Spark >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25957) Skip building spark-r docker image if spark distribution does not have R support
[ https://issues.apache.org/jira/browse/SPARK-25957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25957: Assignee: (was: Apache Spark) > Skip building spark-r docker image if spark distribution does not have R > support > > > Key: SPARK-25957 > URL: https://issues.apache.org/jira/browse/SPARK-25957 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Nagaram Prasad Addepally >Priority: Major > > [docker-image-tool.sh|https://github.com/apache/spark/blob/master/bin/docker-image-tool.sh] > script by default tries to build spark-r image. We may not always build > spark distribution with R support. It would be good to skip building and > publishing spark-r images if R support is not available in the spark > distribution. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25883) Override method `prettyName` in `from_avro`/`to_avro`
[ https://issues.apache.org/jira/browse/SPARK-25883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-25883: - Fix Version/s: 2.4.1 > Override method `prettyName` in `from_avro`/`to_avro` > - > > Key: SPARK-25883 > URL: https://issues.apache.org/jira/browse/SPARK-25883 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > Fix For: 2.4.1, 3.0.0 > > > Previously in from_avro/to_avro, we override the method `simpleString` and > `sql` for the string output. However, the override only affects the alias > naming: > ``` > Project [from_avro('col, > ... > , (mode,PERMISSIVE)) AS from_avro(col, struct, > Map(mode -> PERMISSIVE))#11] > ``` > It only makes the alias name quite long. > We should follow `from_csv`/`from_json` here, to override the method > prettyName only, and we will get a clean alias name > ``` > ... AS from_avro(col)#11 > ``` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26035) Break large streaming/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26035. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23034 [https://github.com/apache/spark/pull/23034] > Break large streaming/tests.py files into smaller files > --- > > Key: SPARK-26035 > URL: https://issues.apache.org/jira/browse/SPARK-26035 > Project: Spark > Issue Type: Sub-task > Components: DStreams, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26035) Break large streaming/tests.py files into smaller files
[ https://issues.apache.org/jira/browse/SPARK-26035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26035: Assignee: Hyukjin Kwon > Break large streaming/tests.py files into smaller files > --- > > Key: SPARK-26035 > URL: https://issues.apache.org/jira/browse/SPARK-26035 > Project: Spark > Issue Type: Sub-task > Components: DStreams, PySpark >Affects Versions: 2.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
Simeon Simeonov created SPARK-26084: --- Summary: AggregateExpression.references fails on unresolved expression trees Key: SPARK-26084 URL: https://issues.apache.org/jira/browse/SPARK-26084 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Simeon Simeonov [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a stable ordering in {{AttributeSet.toSeq}} using expression IDs ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) without noticing that {{AggregateExpression.references}} used {{AttributeSet.toSeq}} as a shortcut ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). The net result is that {{AggregateExpression.references}} fails for unresolved aggregate functions. {code:scala} org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, isDistinct = false ).references {code} fails with {code:scala} org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to exprId on unresolved object, tree: 'y at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) at scala.collection.AbstractSeq.sorted(Seq.scala:41) at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) at scala.collection.AbstractSeq.sortBy(Seq.scala:41) at org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) {code} The solution is to avoid calling {{toSeq}} as ordering is not important in {{references}} and simplify (and speed up) the implementation to something like {code:scala} mode match { case Partial | Complete => aggregateFunction.references case PartialMerge | Final => AttributeSet(aggregateFunction.aggBufferAttributes) } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26084) AggregateExpression.references fails on unresolved expression trees
[ https://issues.apache.org/jira/browse/SPARK-26084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688760#comment-16688760 ] Simeon Simeonov commented on SPARK-26084: - /cc [~maropu] [~hvanhovell] who worked on the PR that may have caused this problem > AggregateExpression.references fails on unresolved expression trees > --- > > Key: SPARK-26084 > URL: https://issues.apache.org/jira/browse/SPARK-26084 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: aggregate, regression, sql > > [SPARK-18394|https://issues.apache.org/jira/browse/SPARK-18394] introduced a > stable ordering in {{AttributeSet.toSeq}} using expression IDs > ([PR-18959|https://github.com/apache/spark/pull/18959/files#diff-75576f0ec7f9d8b5032000245217d233R128]) > without noticing that {{AggregateExpression.references}} used > {{AttributeSet.toSeq}} as a shortcut > ([link|https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L132]). > The net result is that {{AggregateExpression.references}} fails for > unresolved aggregate functions. > {code:scala} > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression( > org.apache.spark.sql.catalyst.expressions.aggregate.Sum(('x + 'y).expr), > mode = org.apache.spark.sql.catalyst.expressions.aggregate.Complete, > isDistinct = false > ).references > {code} > fails with > {code:scala} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > exprId on unresolved object, tree: 'y > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.exprId(unresolved.scala:104) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet$$anonfun$toSeq$2.apply(AttributeSet.scala:128) > at scala.math.Ordering$$anon$5.compare(Ordering.scala:122) > at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) > at java.util.TimSort.sort(TimSort.java:220) > at java.util.Arrays.sort(Arrays.java:1438) > at scala.collection.SeqLike$class.sorted(SeqLike.scala:648) > at scala.collection.AbstractSeq.sorted(Seq.scala:41) > at scala.collection.SeqLike$class.sortBy(SeqLike.scala:623) > at scala.collection.AbstractSeq.sortBy(Seq.scala:41) > at > org.apache.spark.sql.catalyst.expressions.AttributeSet.toSeq(AttributeSet.scala:128) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.references(interfaces.scala:201) > {code} > The solution is to avoid calling {{toSeq}} as ordering is not important in > {{references}} and simplify (and speed up) the implementation to something > like > {code:scala} > mode match { > case Partial | Complete => aggregateFunction.references > case PartialMerge | Final => > AttributeSet(aggregateFunction.aggBufferAttributes) > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688743#comment-16688743 ] Apache Spark commented on SPARK-26083: -- User 'AzureQ' has created a pull request for this issue: https://github.com/apache/spark/pull/23037 > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688741#comment-16688741 ] Apache Spark commented on SPARK-26083: -- User 'AzureQ' has created a pull request for this issue: https://github.com/apache/spark/pull/23037 > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26083: Assignee: Apache Spark > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Assignee: Apache Spark >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26083: Assignee: (was: Apache Spark) > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Shao updated SPARK-26083: Description: When I try to run {code:java} ./bin/pyspark{code} in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: {code:java} $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'{code} This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} was: When I try to run {code:java} ./bin/pyspark{code} in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: {code:java} $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'{code} This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. > Could not open PYTHONSTARTUP > IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
[ https://issues.apache.org/jira/browse/SPARK-26083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qi Shao updated SPARK-26083: Description: When I try to run {code:java} ./bin/pyspark{code} in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: {code:java} $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'{code} This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} was: When I try to run {{}} {code:java} ./bin/pyspark{code} {{}}in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: {code:java} $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'{code} This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} > Pyspark command is not working properly with default Docker Image build > --- > > Key: SPARK-26083 > URL: https://issues.apache.org/jira/browse/SPARK-26083 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Qi Shao >Priority: Minor > Labels: easyfix, newbie, patch, pull-request-available > Fix For: 2.4.1 > > > When I try to run > {code:java} > ./bin/pyspark{code} > in a pod in Kubernetes(image built without change from pyspark Dockerfile), > I'm getting an error: > {code:java} > $SPARK_HOME/bin/pyspark --deploy-mode client --master > k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... > Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type > "help", "copyright", "credits" or "license" for more information. Could not > open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: > '/opt/spark/python/pyspark/shell.py'{code} > This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26083) Pyspark command is not working properly with default Docker Image build
Qi Shao created SPARK-26083: --- Summary: Pyspark command is not working properly with default Docker Image build Key: SPARK-26083 URL: https://issues.apache.org/jira/browse/SPARK-26083 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 2.4.0 Reporter: Qi Shao Fix For: 2.4.1 When I try to run {{}} {code:java} ./bin/pyspark{code} {{}}in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error: {code:java} $SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ... Python 2.7.15 (default, Aug 22 2018, 13:24:18) [GCC 6.4.0] on linux2 Type "help", "copyright", "credits" or "license" for more information. Could not open PYTHONSTARTUP IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'{code} This is because {{pyspark}} folder doesn't exist under {{/opt/spark/python/}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25035) Replicating disk-stored blocks should avoid memory mapping
[ https://issues.apache.org/jira/browse/SPARK-25035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid updated SPARK-25035: - Labels: memory-analysis (was: ) > Replicating disk-stored blocks should avoid memory mapping > -- > > Key: SPARK-25035 > URL: https://issues.apache.org/jira/browse/SPARK-25035 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.1 >Reporter: Imran Rashid >Priority: Major > Labels: memory-analysis > > This is a follow-up to SPARK-24296. > When replicating a disk-cached block, even if we fetch-to-disk, we still > memory-map the file, just to copy it to another location. > Ideally we'd just move the tmp file to the right location. But even without > that, we could read the file as an input stream, instead of memory-mapping > the whole thing. Memory-mapping is particularly a problem when running under > yarn, as the OS may believe there is plenty of memory available, meanwhile > yarn decides to kill the process for exceeding memory limits. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
[ https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Loncaric updated SPARK-26082: Description: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable overhead network traffic and startup time when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update {{MesosClusterScheduler.scala}} to use {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line change). was: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable overhead network traffic and startup time when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line change). > Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler > --- > > Key: SPARK-26082 > URL: https://issues.apache.org/jira/browse/SPARK-26082 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: Martin Loncaric >Priority: Major > > Currently in > [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: > {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs > (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the > Mesos Fetcher Cache > {quote} > Currently in {{MesosClusterScheduler.scala}} (which passes parameter to > driver): > {{private val useFetchCache = > conf.getBoolean("spark.mesos.fetchCache.enable", false)}} > Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos > caching parameter to
[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
[ https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Loncaric updated SPARK-26082: Description: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable overhead network traffic and startup time when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward (literally a one-line change). was: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable overhead network traffic and startup time when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward. > Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler > --- > > Key: SPARK-26082 > URL: https://issues.apache.org/jira/browse/SPARK-26082 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: Martin Loncaric >Priority: Major > > Currently in > [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: > {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs > (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the > Mesos Fetcher Cache > {quote} > Currently in {{MesosClusterScheduler.scala}} (which passes parameter to > driver): > {{private val useFetchCache = > conf.getBoolean("spark.mesos.fetchCache.enable", false)}} > Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos > caching parameter to executors): > {{private val useFetcherCache = >
[jira] [Updated] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
[ https://issues.apache.org/jira/browse/SPARK-26082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Martin Loncaric updated SPARK-26082: Description: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable overhead network traffic and startup time when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward. was: Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable network traffic when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward. > Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler > --- > > Key: SPARK-26082 > URL: https://issues.apache.org/jira/browse/SPARK-26082 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, > 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2 >Reporter: Martin Loncaric >Priority: Major > > Currently in > [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: > {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs > (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the > Mesos Fetcher Cache > {quote} > Currently in {{MesosClusterScheduler.scala}} (which passes parameter to > driver): > {{private val useFetchCache = > conf.getBoolean("spark.mesos.fetchCache.enable", false)}} > Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos > caching parameter to executors): > {{private val useFetcherCache = > conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} > This naming
[jira] [Created] (SPARK-26082) Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler
Martin Loncaric created SPARK-26082: --- Summary: Misnaming of spark.mesos.fetch(er)Cache.enable in MesosClusterScheduler Key: SPARK-26082 URL: https://issues.apache.org/jira/browse/SPARK-26082 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 2.3.2, 2.3.1, 2.3.0, 2.2.2, 2.2.1, 2.2.0, 2.1.3, 2.1.2, 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Martin Loncaric Currently in [docs|https://spark.apache.org/docs/latest/running-on-mesos.html]: {quote}spark.mesos.fetcherCache.enable / false / If set to `true`, all URIs (example: `spark.executor.uri`, `spark.mesos.uris`) will be cached by the Mesos Fetcher Cache {quote} Currently in {{MesosClusterScheduler.scala}} (which passes parameter to driver): {{private val useFetchCache = conf.getBoolean("spark.mesos.fetchCache.enable", false)}} Currently in {{MesosCourseGrainedSchedulerBackend.scala}} (which passes mesos caching parameter to executors): {{private val useFetcherCache = conf.getBoolean("spark.mesos.fetcherCache.enable", false)}} This naming discrepancy dates back to version 2.0.0 ([jira|http://mail-archives.apache.org/mod_mbox/spark-issues/201606.mbox/%3cjira.12979909.1466099309000.9921.1466101026...@atlassian.jira%3E]). This means that when {{spark.mesos.fetcherCache.enable=true}} is specified, the Mesos cache will be used only for executors, and not for drivers. IMPACT: Not caching these driver files (typically including at least spark binaries, custom jar, and additional dependencies) adds considerable network traffic when frequently running spark Applications on a Mesos cluster. Additionally, since extracted files like {{spark-x.x.x-bin-*.tgz}} are additionally copied and left in the sandbox with the cache off (rather than extracted directly without an extra copy), this can considerably increase disk usage. Users CAN currently workaround by specifying the {{spark.mesos.fetchCache.enable}} option, but this should at least be specified in the documentation. SUGGESTED FIX: Add {{spark.mesos.fetchCache.enable}} to the documentation for versions 2 - 2.4, and update to {{spark.mesos.fetcherCache.enable}} going forward. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26081) Do not write empty files by text datasources
[ https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26081: Assignee: (was: Apache Spark) > Do not write empty files by text datasources > > > Key: SPARK-26081 > URL: https://issues.apache.org/jira/browse/SPARK-26081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Text based datasources like CSV, JSON and Text produces empty files for empty > partitions. This introduces additional overhead while opening and reading > such files back. In current implementation of OutputWriter, the output stream > are created eagerly even no records are written to the stream. So, creation > can be postponed up to the first write. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26081) Do not write empty files by text datasources
[ https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26081: Assignee: Apache Spark > Do not write empty files by text datasources > > > Key: SPARK-26081 > URL: https://issues.apache.org/jira/browse/SPARK-26081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Text based datasources like CSV, JSON and Text produces empty files for empty > partitions. This introduces additional overhead while opening and reading > such files back. In current implementation of OutputWriter, the output stream > are created eagerly even no records are written to the stream. So, creation > can be postponed up to the first write. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26081) Do not write empty files by text datasources
[ https://issues.apache.org/jira/browse/SPARK-26081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688664#comment-16688664 ] Apache Spark commented on SPARK-26081: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/23052 > Do not write empty files by text datasources > > > Key: SPARK-26081 > URL: https://issues.apache.org/jira/browse/SPARK-26081 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Text based datasources like CSV, JSON and Text produces empty files for empty > partitions. This introduces additional overhead while opening and reading > such files back. In current implementation of OutputWriter, the output stream > are created eagerly even no records are written to the stream. So, creation > can be postponed up to the first write. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688653#comment-16688653 ] Apache Spark commented on SPARK-23128: -- User 'justinuang' has created a pull request for this issue: https://github.com/apache/spark/pull/23051 > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688654#comment-16688654 ] Apache Spark commented on SPARK-23128: -- User 'justinuang' has created a pull request for this issue: https://github.com/apache/spark/pull/23051 > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Priority: Major > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26081) Do not write empty files by text datasources
Maxim Gekk created SPARK-26081: -- Summary: Do not write empty files by text datasources Key: SPARK-26081 URL: https://issues.apache.org/jira/browse/SPARK-26081 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Text based datasources like CSV, JSON and Text produces empty files for empty partitions. This introduces additional overhead while opening and reading such files back. In current implementation of OutputWriter, the output stream are created eagerly even no records are written to the stream. So, creation can be postponed up to the first write. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26080) Unable to run worker.py on Windows
Hayden Jeune created SPARK-26080: Summary: Unable to run worker.py on Windows Key: SPARK-26080 URL: https://issues.apache.org/jira/browse/SPARK-26080 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.0 Environment: Windows 10 Education 64 bit Reporter: Hayden Jeune Use of the resource module in python means worker.py cannot run on a windows system. This package is only available in unix based environments. [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] {code:python} textFile = sc.textFile("README.md") textFile.first() {code} When the above commands are run I receive the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource' I do not really know enough about what I'm doing to fix this myself. Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite
[ https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26079: Assignee: Apache Spark > Flaky test: StreamingQueryListenersConfSuite > > > Key: SPARK-26079 > URL: https://issues.apache.org/jira/browse/SPARK-26079 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark >Priority: Minor > > We've had this test fail a few times in our builds. > {noformat} > org.scalatest.exceptions.TestFailedException: null equaled null > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > {noformat} > You can reproduce it reliably by adding a sleep in the test listener. Fix > coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688564#comment-16688564 ] Stavros Kontopoulos edited comment on SPARK-14220 at 11/15/18 7:45 PM: --- [~SeanShubin] That one was fixed here: [https://jira.apache.org/jira/browse/SPARK-22128] if not mistaken [~srowen] knows more. was (Author: skonto): That one was fixed here: [https://jira.apache.org/jira/browse/SPARK-22128] if not mistaken [~srowen] knows more. > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688564#comment-16688564 ] Stavros Kontopoulos commented on SPARK-14220: - That one was fixed here: [https://jira.apache.org/jira/browse/SPARK-22128] if not mistaken [~srowen] knows more. > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Sean Owen >Priority: Blocker > Labels: release-notes > Fix For: 2.4.0 > > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite
[ https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26079: Assignee: (was: Apache Spark) > Flaky test: StreamingQueryListenersConfSuite > > > Key: SPARK-26079 > URL: https://issues.apache.org/jira/browse/SPARK-26079 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We've had this test fail a few times in our builds. > {noformat} > org.scalatest.exceptions.TestFailedException: null equaled null > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > {noformat} > You can reproduce it reliably by adding a sleep in the test listener. Fix > coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite
[ https://issues.apache.org/jira/browse/SPARK-26079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16688561#comment-16688561 ] Apache Spark commented on SPARK-26079: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/23050 > Flaky test: StreamingQueryListenersConfSuite > > > Key: SPARK-26079 > URL: https://issues.apache.org/jira/browse/SPARK-26079 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Minor > > We've had this test fail a few times in our builds. > {noformat} > org.scalatest.exceptions.TestFailedException: null equaled null > at > org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45) > at > org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > {noformat} > You can reproduce it reliably by adding a sleep in the test listener. Fix > coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26079) Flaky test: StreamingQueryListenersConfSuite
Marcelo Vanzin created SPARK-26079: -- Summary: Flaky test: StreamingQueryListenersConfSuite Key: SPARK-26079 URL: https://issues.apache.org/jira/browse/SPARK-26079 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.4.0 Reporter: Marcelo Vanzin We've had this test fail a few times in our builds. {noformat} org.scalatest.exceptions.TestFailedException: null equaled null at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:528) at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:501) at org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:45) at org.apache.spark.sql.streaming.StreamingQueryListenersConfSuite$$anonfun$1.apply(StreamingQueryListenersConfSuite.scala:38) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) {noformat} You can reproduce it reliably by adding a sleep in the test listener. Fix coming up. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org