[jira] [Commented] (SPARK-22192) An RDD of nested POJO objects cannot be converted into a DataFrame using SQLContext.createDataFrame API

2017-10-03 Thread Asif Hussain Shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190852#comment-16190852
 ] 

Asif Hussain Shahid commented on SPARK-22192:
-

Not at all. I have added a bug test in org.apache.spark.sql.SQLContextSuite  in 
the branch SPARK-22192 

> An RDD of nested POJO objects cannot be converted into a DataFrame using 
> SQLContext.createDataFrame API
> ---
>
> Key: SPARK-22192
> URL: https://issues.apache.org/jira/browse/SPARK-22192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Independent of OS / platform
>Reporter: Asif Hussain Shahid
>Priority: Minor
>
> If an RDD contains nested POJO objects, then SQLContext.createDataFrame(RDD, 
> Class) api only handles the top level POJO object. It throws ScalaMatchError 
> exception when handling the nested POJO object as the code does not 
> recursively handle the nested POJOs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22192) An RDD of nested POJO objects cannot be converted into a DataFrame using SQLContext.createDataFrame API

2017-10-03 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190838#comment-16190838
 ] 

Hyukjin Kwon commented on SPARK-22192:
--

Would you mind if I ask a reproducer?

> An RDD of nested POJO objects cannot be converted into a DataFrame using 
> SQLContext.createDataFrame API
> ---
>
> Key: SPARK-22192
> URL: https://issues.apache.org/jira/browse/SPARK-22192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: Independent of OS / platform
>Reporter: Asif Hussain Shahid
>Priority: Minor
>
> If an RDD contains nested POJO objects, then SQLContext.createDataFrame(RDD, 
> Class) api only handles the top level POJO object. It throws ScalaMatchError 
> exception when handling the nested POJO object as the code does not 
> recursively handle the nested POJOs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22136) Implement stream-stream outer joins in append mode

2017-10-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-22136.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 19327
[https://github.com/apache/spark/pull/19327]

> Implement stream-stream outer joins in append mode
> --
>
> Key: SPARK-22136
> URL: https://issues.apache.org/jira/browse/SPARK-22136
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
> Fix For: 3.0.0
>
>
> Followup to inner join subtask. We can implement outer joins by generating 
> null rows when old state gets cleaned up, given that the join has watermarks 
> allowing that cleanup to happen.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21951) Unable to add the new column and writing into the Hive using spark

2017-10-03 Thread jalendhar Baddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jalendhar Baddam updated SPARK-21951:
-
Issue Type: Question  (was: Bug)

> Unable to add the new column and writing into the Hive using spark
> --
>
> Key: SPARK-21951
> URL: https://issues.apache.org/jira/browse/SPARK-21951
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> I am creating one new column to the Existing Dataset and unable to write into 
> the Hive using the Spark.
> Ex: Dataset ds=spark.sql("select *from Table");
>  ds= ds.withColumn("newColumn",newColumnvalues);
> ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); 
> //Here I am getting the Exception
> I am loading the Table from Hive using Spark, and  adding the new Column to 
> that Dataset and again write the same table into Hive with the "OverWrite" 
> option



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java

2017-10-03 Thread jalendhar Baddam (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jalendhar Baddam updated SPARK-21952:
-
Issue Type: Question  (was: Bug)

> Unable to load the csv file into Dataset  using Spark with java
> ---
>
> Key: SPARK-21952
> URL: https://issues.apache.org/jira/browse/SPARK-21952
> Project: Spark
>  Issue Type: Question
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> Hi,
> I am trying to load the one csv file using spark with java ,The csv file 
> contains the one row with two end lines.I am attaching the csv file .Placing 
> the sample csv file content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22171) Describe Table Extended Failed when Table Owner is Empty

2017-10-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22171.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Describe Table Extended Failed when Table Owner is Empty
> 
>
> Key: SPARK-22171
> URL: https://issues.apache.org/jira/browse/SPARK-22171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> Users could hit `java.lang.NullPointerException` when the tables were created 
> by Hive and the table's owner is `null` that are got from Hive metastore. 
> `DESC EXTENDED` failed with the error:
> {noformat}
> SQLExecutionException: java.lang.NullPointerException at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at 
> scala.collection.immutable.StringOps.length(StringOps.scala:47) at 
> scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
>  at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at 
> scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at 
> scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at 
> org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66)
>  at 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20557) JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE

2017-10-03 Thread Dan Stine (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190706#comment-16190706
 ] 

Dan Stine commented on SPARK-20557:
---

[~JannikArndt] [~smilegator] Can you help me understand the status of this 
issue? I see two PRs both merged to master. But the ticket is unresolved. Is 
there more to do? If not, in what version of Spark will this land? Thanks in 
advance.

> JdbcUtils doesn't support java.sql.Types.TIMESTAMP_WITH_TIMEZONE
> 
>
> Key: SPARK-20557
> URL: https://issues.apache.org/jira/browse/SPARK-20557
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.3.0
>Reporter: Jannik Arndt
>  Labels: easyfix, jdbc, oracle, sql, timestamp
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Reading from an Oracle DB table with a column of type TIMESTAMP WITH TIME 
> ZONE via jdbc ({{spark.sqlContext.read.format("jdbc").option(...).load()}}) 
> results in an error:
> {{Unsupported type -101}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$getCatalystType(JdbcUtils.scala:209)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> {{org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$5.apply(JdbcUtils.scala:246)}}
> That is because the type 
> {{[java.sql.Types.TIMESTAMP_WITH_TIMEZONE|https://docs.oracle.com/javase/8/docs/api/java/sql/Types.html#TIMESTAMP_WITH_TIMEZONE]}}
>  (in Java since 1.8) is missing in 
> {{[JdbcUtils.scala|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L225]}}
>  
> This is similar to SPARK-7039.
> I created a pull request with a fix.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22195) Add cosine similarity to org.apache.spark.ml.linalg.Vectors

2017-10-03 Thread yuhao yang (JIRA)
yuhao yang created SPARK-22195:
--

 Summary: Add cosine similarity to 
org.apache.spark.ml.linalg.Vectors
 Key: SPARK-22195
 URL: https://issues.apache.org/jira/browse/SPARK-22195
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.2.0
Reporter: yuhao yang
Priority: Minor


https://en.wikipedia.org/wiki/Cosine_similarity:
As the most important measure of similarity, I found it quite useful in some 
image and NLP applications according to personal experience.

Suggest to add function for cosine similarity in 
org.apache.spark.ml.linalg.Vectors.

Interface:

  def cosineSimilarity(v1: Vector, v2: Vector): Double = ...
  def cosineSimilarity(v1: Vector, v2: Vector, norm1: Double, norm2: Double): 
Double = ...

Appreciate suggestions and need green light from committers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22193) SortMergeJoinExec: typo correction

2017-10-03 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190580#comment-16190580
 ] 

Takeshi Yamamuro commented on SPARK-22193:
--

You probably don't file a jira for trivial fixes.

> SortMergeJoinExec: typo correction
> --
>
> Key: SPARK-22193
> URL: https://issues.apache.org/jira/browse/SPARK-22193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rekha Joshi
>Priority: Trivial
>
> typo correction in SortMergeJoinExec. Nothing major, but it bothered me going 
> into.Hence fixing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19426) Add support for custom coalescers on Data

2017-10-03 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro closed SPARK-19426.

Resolution: Later

> Add support for custom coalescers on Data
> -
>
> Key: SPARK-19426
> URL: https://issues.apache.org/jira/browse/SPARK-19426
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marius Van Niekerk
>Priority: Minor
>
> This is a continuation of SPARK-14042 now that the Dataset api's have 
> stabilized in Spark 2+.
> Provide the same PartitionCoalescer support that exists in the RDD api



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19426) Add support for custom coalescers on Data

2017-10-03 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190578#comment-16190578
 ] 

Takeshi Yamamuro commented on SPARK-19426:
--

I'll close for now cuz the priority is not much high. We might revisit later. 
Thanks.

> Add support for custom coalescers on Data
> -
>
> Key: SPARK-19426
> URL: https://issues.apache.org/jira/browse/SPARK-19426
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marius Van Niekerk
>Priority: Minor
>
> This is a continuation of SPARK-14042 now that the Dataset api's have 
> stabilized in Spark 2+.
> Provide the same PartitionCoalescer support that exists in the RDD api



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20466) HadoopRDD#addLocalConfiguration throws NPE

2017-10-03 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-20466.

   Resolution: Fixed
 Assignee: Sahil Takiar
Fix Version/s: 2.1.3
   2.3.0
   2.2.1

> HadoopRDD#addLocalConfiguration throws NPE
> --
>
> Key: SPARK-20466
> URL: https://issues.apache.org/jira/browse/SPARK-20466
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.2
>Reporter: liyunzhang_intel
>Assignee: Sahil Takiar
>Priority: Minor
> Fix For: 2.2.1, 2.3.0, 2.1.3
>
> Attachments: NPE_log
>
>
> in spark2.0.2, it throws NPE
> {code}
>   17/04/23 08:19:55 ERROR executor.Executor: Exception in task 439.0 in stage 
> 16.0 (TID 986)$ 
> java.lang.NullPointerException$
> ^Iat 
> org.apache.spark.rdd.HadoopRDD$.addLocalConfiguration(HadoopRDD.scala:373)$
> ^Iat org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:243)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)$
> ^Iat org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)$
> ^Iat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)$
> ^Iat org.apache.spark.rdd.RDD.iterator(RDD.scala:283)$
> ^Iat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)$
> ^Iat org.apache.spark.scheduler.Task.run(Task.scala:86)$
> ^Iat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)$
> ^Iat 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)$
> ^Iat java.lang.Thread.run(Thread.java:745)$
> {code}
> suggestion to add some code to avoid NPE
> {code} 
>/** Add Hadoop configuration specific to a single partition and attempt. */
>   def addLocalConfiguration(jobTrackerId: String, jobId: Int, splitId: Int, 
> attemptId: Int,
> conf: JobConf) {
> val jobID = new JobID(jobTrackerId, jobId)
> val taId = new TaskAttemptID(new TaskID(jobID, TaskType.MAP, splitId), 
> attemptId)
> if ( conf != null){
> conf.set("mapred.tip.id", taId.getTaskID.toString)
> conf.set("mapred.task.id", taId.toString)
> conf.setBoolean("mapred.task.is.map", true)
> conf.setInt("mapred.task.partition", splitId)
> conf.set("mapred.job.id", jobID.toString)
>}
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22194) Allow namespacing of configs in spark.internal.config

2017-10-03 Thread Gregory Owen (JIRA)
Gregory Owen created SPARK-22194:


 Summary: Allow namespacing of configs in spark.internal.config
 Key: SPARK-22194
 URL: https://issues.apache.org/jira/browse/SPARK-22194
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Gregory Owen
Priority: Minor


spark.internal.config is in danger of becoming an unmanageable dumping ground 
as we add more and more configs from every part of the project. It would be 
nice to have a namespaced/federated config system where the configs for 
different components can live in different files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22193) SortMergeJoinExec: typo correction

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22193:


Assignee: Apache Spark

> SortMergeJoinExec: typo correction
> --
>
> Key: SPARK-22193
> URL: https://issues.apache.org/jira/browse/SPARK-22193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rekha Joshi
>Assignee: Apache Spark
>Priority: Trivial
>
> typo correction in SortMergeJoinExec. Nothing major, but it bothered me going 
> into.Hence fixing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22193) SortMergeJoinExec: typo correction

2017-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190365#comment-16190365
 ] 

Apache Spark commented on SPARK-22193:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/19422

> SortMergeJoinExec: typo correction
> --
>
> Key: SPARK-22193
> URL: https://issues.apache.org/jira/browse/SPARK-22193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rekha Joshi
>Priority: Trivial
>
> typo correction in SortMergeJoinExec. Nothing major, but it bothered me going 
> into.Hence fixing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22193) SortMergeJoinExec: typo correction

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22193:


Assignee: (was: Apache Spark)

> SortMergeJoinExec: typo correction
> --
>
> Key: SPARK-22193
> URL: https://issues.apache.org/jira/browse/SPARK-22193
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Rekha Joshi
>Priority: Trivial
>
> typo correction in SortMergeJoinExec. Nothing major, but it bothered me going 
> into.Hence fixing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22193) SortMergeJoinExec: typo correction

2017-10-03 Thread Rekha Joshi (JIRA)
Rekha Joshi created SPARK-22193:
---

 Summary: SortMergeJoinExec: typo correction
 Key: SPARK-22193
 URL: https://issues.apache.org/jira/browse/SPARK-22193
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Rekha Joshi
Priority: Trivial


typo correction in SortMergeJoinExec. Nothing major, but it bothered me going 
into.Hence fixing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-10-03 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190239#comment-16190239
 ] 

yuhao yang commented on SPARK-21866:


My two cents,

1. In most scenarios, deep learning applications use rescaled/cropped images 
(typically 256, 224 or smaller). I would add an extra parameter "smallSideSize" 
to the readImages method, which is more convenient for the users and we don't 
need to cache the image of original size (which could be 100 times larger than 
the scaled image). 

2. Not sure about the reason to include path info into the image data. Based on 
my experience, path info serves better as a separate column in the DataFrame.

3.  After some argumentation and normalization, the image data will be floating 
point numbers rather than the bytes. It's fine if the current format is only 
for reading the image data, but not as the standard image feature exchange 
format in Spark.

4. I don't see the parameter "recursive" as necessary. Existing wild card 
matching provides more functions. 

Part of the image pre-processing code I used (a little stale) is available from 
https://github.com/hhbyyh/SparkDL, just for reference.



> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general 

[jira] [Resolved] (SPARK-21644) LocalLimit.maxRows is defined incorrectly

2017-10-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21644.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> LocalLimit.maxRows is defined incorrectly
> -
>
> Key: SPARK-21644
> URL: https://issues.apache.org/jira/browse/SPARK-21644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.3.0
>
>
> {code}
> case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends 
> UnaryNode {
>   override def output: Seq[Attribute] = child.output
>   override def maxRows: Option[Long] = {
> limitExpr match {
>   case IntegerLiteral(limit) => Some(limit)
>   case _ => None
> }
>   }
> }
> {code}
> This is simply wrong, since LocalLimit is only about partition level limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22192) An RDD of nested POJO objects cannot be converted into a DataFrame using SQLContext.createDataFrame API

2017-10-03 Thread Asif Hussain Shahid (JIRA)
Asif Hussain Shahid created SPARK-22192:
---

 Summary: An RDD of nested POJO objects cannot be converted into a 
DataFrame using SQLContext.createDataFrame API
 Key: SPARK-22192
 URL: https://issues.apache.org/jira/browse/SPARK-22192
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: Independent of OS / platform
Reporter: Asif Hussain Shahid
Priority: Minor


If an RDD contains nested POJO objects, then SQLContext.createDataFrame(RDD, 
Class) api only handles the top level POJO object. It throws ScalaMatchError 
exception when handling the nested POJO object as the code does not recursively 
handle the nested POJOs.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22158) convertMetastore should not ignore table properties

2017-10-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22158.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.3.0
   2.2.1

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-22158
> URL: https://issues.apache.org/jira/browse/SPARK-22158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.2.1, 2.3.0
>
>
> From the beginning, convertMetastoreOrc ignores table properties and use an 
> emtpy map instead. It's the same with convertMetastoreParquet.
> {code}
> val options = Map[String, String]()
> {code}
> - SPARK-14070: 
> https://github.com/apache/spark/pull/11891/files#diff-ee66e11b56c21364760a5ed2b783f863R650
> - master: 
> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L197



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19984) ERROR codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2017-10-03 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190050#comment-16190050
 ] 

Kazuaki Ishizaki commented on SPARK-19984:
--

[~JohnSteidley] Thank you for providing valuable information.
I succeeded a small program derived from your example that has the following 
physical plan. I think that this physical plan is almost the same as you 
provided. Only the difference looks whether parquet is used or not.
However, I cannot reproduce this program using the small program using Spark 
2.1 or master branch. Tomorrow, I will try to use parquet in the small program.

{code}
== Physical Plan ==
*HashAggregate(keys=[], functions=[count(A#52)], output=[count(A)#63L])
+- *HashAggregate(keys=[], functions=[partial_count(A#52)], output=[count#68L])
   +- *Project [A#52]
  +- *SortMergeJoin [A#52], [A#43], Inner
 :- *Sort [A#52 ASC NULLS FIRST], false, 0
 :  +- Exchange hashpartitioning(A#52, 1)
 : +- *Project [value#50 AS A#52]
 :+- *Filter isnotnull(value#50)
 :   +- *SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true) AS value#50]
 :  +- Scan ExternalRDDScan[obj#49]
 +- *Sort [A#43 ASC NULLS FIRST], false, 0
+- *Project [id#39 AS A#43]
   +- *Filter isnotnull(id#39)
  +- *GlobalLimit 2
 +- Exchange SinglePartition
+- *LocalLimit 2
   +- *Project [value#37 AS id#39]
  +- *SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true) AS value#37]
 +- Scan ExternalRDDScan[obj#36]
{code}

> ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-19984
> URL: https://issues.apache.org/jira/browse/SPARK-19984
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.1.0
>Reporter: Andrey Yakovenko
> Attachments: after_adding_count.txt, before_adding_count.txt
>
>
> I had this error few time on my local hadoop 2.7.3+Spark2.1.0 environment. 
> This is not permanent error, next time i run it could disappear. 
> Unfortunately i don't know how to reproduce the issue.  As you can see from 
> the log my logic is pretty complicated.
> Here is a part of log i've got (container_1489514660953_0015_01_01)
> {code}
> 17/03/16 11:07:04 ERROR codegen.CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 151, Column 29: A method named "compare" is not declared in any enclosing 
> class nor any supertype, nor through a static import
> /* 001 */ public Object generate(Object[] references) {
> /* 002 */   return new GeneratedIterator(references);
> /* 003 */ }
> /* 004 */
> /* 005 */ final class GeneratedIterator extends 
> org.apache.spark.sql.execution.BufferedRowIterator {
> /* 006 */   private Object[] references;
> /* 007 */   private scala.collection.Iterator[] inputs;
> /* 008 */   private boolean agg_initAgg;
> /* 009 */   private boolean agg_bufIsNull;
> /* 010 */   private long agg_bufValue;
> /* 011 */   private boolean agg_initAgg1;
> /* 012 */   private boolean agg_bufIsNull1;
> /* 013 */   private long agg_bufValue1;
> /* 014 */   private scala.collection.Iterator smj_leftInput;
> /* 015 */   private scala.collection.Iterator smj_rightInput;
> /* 016 */   private InternalRow smj_leftRow;
> /* 017 */   private InternalRow smj_rightRow;
> /* 018 */   private UTF8String smj_value2;
> /* 019 */   private java.util.ArrayList smj_matches;
> /* 020 */   private UTF8String smj_value3;
> /* 021 */   private UTF8String smj_value4;
> /* 022 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> smj_numOutputRows;
> /* 023 */   private UnsafeRow smj_result;
> /* 024 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder smj_holder;
> /* 025 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> smj_rowWriter;
> /* 026 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_numOutputRows;
> /* 027 */   private org.apache.spark.sql.execution.metric.SQLMetric 
> agg_aggTime;
> /* 028 */   private UnsafeRow agg_result;
> /* 029 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder agg_holder;
> /* 030 */   private 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
> agg_rowWriter;
> /* 031 */   private 

[jira] [Commented] (SPARK-22191) Add hive serde example with serde properties

2017-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16190047#comment-16190047
 ] 

Apache Spark commented on SPARK-22191:
--

User 'crlalam' has created a pull request for this issue:
https://github.com/apache/spark/pull/19420

> Add hive serde example with serde properties
> 
>
> Key: SPARK-22191
> URL: https://issues.apache.org/jira/browse/SPARK-22191
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Chinna Rao Lalam
>Priority: Minor
>
> Added an example for, specifying serde with serde properties for hive tables 
> using OPTIONS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22191) Add hive serde example with serde properties

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22191:


Assignee: Apache Spark

> Add hive serde example with serde properties
> 
>
> Key: SPARK-22191
> URL: https://issues.apache.org/jira/browse/SPARK-22191
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Chinna Rao Lalam
>Assignee: Apache Spark
>Priority: Minor
>
> Added an example for, specifying serde with serde properties for hive tables 
> using OPTIONS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22191) Add hive serde example with serde properties

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22191:


Assignee: (was: Apache Spark)

> Add hive serde example with serde properties
> 
>
> Key: SPARK-22191
> URL: https://issues.apache.org/jira/browse/SPARK-22191
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, SQL
>Affects Versions: 2.2.0
>Reporter: Chinna Rao Lalam
>Priority: Minor
>
> Added an example for, specifying serde with serde properties for hive tables 
> using OPTIONS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22191) Add hive serde example with serde properties

2017-10-03 Thread Chinna Rao Lalam (JIRA)
Chinna Rao Lalam created SPARK-22191:


 Summary: Add hive serde example with serde properties
 Key: SPARK-22191
 URL: https://issues.apache.org/jira/browse/SPARK-22191
 Project: Spark
  Issue Type: Improvement
  Components: Examples, SQL
Affects Versions: 2.2.0
Reporter: Chinna Rao Lalam
Priority: Minor


Added an example for, specifying serde with serde properties for hive tables 
using OPTIONS.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22184:


Assignee: (was: Apache Spark)

> GraphX fails in case of insufficient memory and checkpoints enabled
> ---
>
> Key: SPARK-22184
> URL: https://issues.apache.org/jira/browse/SPARK-22184
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> GraphX fails with FileNotFoundException in case of insufficient memory when 
> checkpoints are enabled.
> Here is the stacktrace 
> {code}
> Job aborted due to stage failure: Task creation failed: 
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697)
> ...
> {code}
> As GraphX uses cached RDDs intensively, the issue is only reproducible when 
> previously cached and checkpointed Vertex and Edge RDDs are evicted from 
> memory and forced to be read from disk. 
> For testing purposes the following parameters may be set to emulate low 
> memory environment
> {code}
> val sparkConf = new SparkConf()
>   .set("spark.graphx.pregel.checkpointInterval", "2")
>   // set testing memory to evict cached RDDs from it and force
>   // reading checkpointed RDDs from disk
>   .set("spark.testing.reservedMemory", "128")
>   .set("spark.testing.memory", "256")
> {code}
> This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is 
> fixed too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22184) GraphX fails in case of insufficient memory and checkpoints enabled

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22184:


Assignee: Apache Spark

> GraphX fails in case of insufficient memory and checkpoints enabled
> ---
>
> Key: SPARK-22184
> URL: https://issues.apache.org/jira/browse/SPARK-22184
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>Assignee: Apache Spark
>
> GraphX fails with FileNotFoundException in case of insufficient memory when 
> checkpoints are enabled.
> Here is the stacktrace 
> {code}
> Job aborted due to stage failure: Task creation failed: 
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
> java.io.FileNotFoundException: File 
> file:/tmp/spark-90119695-a126-47b5-b047-d656fee10c17/9b16e2a9-6c80-45eb-8736-bbb6eb840146/rdd-28/part-0
>  does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:539)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:752)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:529)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:409)
>   at 
> org.apache.spark.rdd.ReliableCheckpointRDD.getPreferredLocations(ReliableCheckpointRDD.scala:89)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$1.apply(RDD.scala:274)
>   at scala.Option.map(Option.scala:146)
>   at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:274)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1697)
> ...
> {code}
> As GraphX uses cached RDDs intensively, the issue is only reproducible when 
> previously cached and checkpointed Vertex and Edge RDDs are evicted from 
> memory and forced to be read from disk. 
> For testing purposes the following parameters may be set to emulate low 
> memory environment
> {code}
> val sparkConf = new SparkConf()
>   .set("spark.graphx.pregel.checkpointInterval", "2")
>   // set testing memory to evict cached RDDs from it and force
>   // reading checkpointed RDDs from disk
>   .set("spark.testing.reservedMemory", "128")
>   .set("spark.testing.memory", "256")
> {code}
> This issue also includes SPARK-22150 and cannot be fixed until SPARK-22150 is 
> fixed too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22167) Spark Packaging w/R distro issues

2017-10-03 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189954#comment-16189954
 ] 

Felix Cheung commented on SPARK-22167:
--

There are likely 2 stages to this.
More pressing might be the fact that hadoop-2.6 and hadoop-2.7 release tgz have 
fairly different content because of how the make-release script is structured.
I will open a new JIRA on this.


> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/3/17 3:21 PM:


Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: 
{noformat} select trans_date, a.trans_hour from table B {noformat}
step2: 
{noformat} select * from tableA where trans_date= and 
a.trans_hour = {noformat}




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: 
{noformat} select trans_date and a.trans_hour from table B {noformat}
step2: 
{noformat} select * from tableA where trans_date= and 
a.trans_hour = {noformat}



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21549) Spark fails to complete job correctly in case of OutputFormat which do not write into hdfs

2017-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21549:
--
Fix Version/s: (was: 2.2.1)

> Spark fails to complete job correctly in case of OutputFormat which do not 
> write into hdfs
> --
>
> Key: SPARK-21549
> URL: https://issues.apache.org/jira/browse/SPARK-21549
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: spark 2.2.0
> scala 2.11
>Reporter: Sergey Zhemzhitsky
>
> Spark fails to complete job correctly in case of custom OutputFormat 
> implementations.
> There are OutputFormat implementations which do not need to use 
> *mapreduce.output.fileoutputformat.outputdir* standard hadoop property.
> [But spark reads this property from the 
> configuration|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/SparkHadoopMapReduceWriter.scala#L79]
>  while setting up an OutputCommitter
> {code:javascript}
> val committer = FileCommitProtocol.instantiate(
>   className = classOf[HadoopMapReduceCommitProtocol].getName,
>   jobId = stageId.toString,
>   outputPath = conf.value.get("mapreduce.output.fileoutputformat.outputdir"),
>   isAppend = false).asInstanceOf[HadoopMapReduceCommitProtocol]
> committer.setupJob(jobContext)
> {code}
> ... and then uses this property later on while [commiting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L132],
>  [aborting the 
> job|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L141],
>  [creating task's temp 
> path|https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala#L95]
> In that cases when the job completes then following exception is thrown
> {code}
> Can not create a Path from a null string
> java.lang.IllegalArgumentException: Can not create a Path from a null string
>   at org.apache.hadoop.fs.Path.checkPathArg(Path.java:123)
>   at org.apache.hadoop.fs.Path.(Path.java:135)
>   at org.apache.hadoop.fs.Path.(Path.java:89)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.absPathStagingDir(HadoopMapReduceCommitProtocol.scala:58)
>   at 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.abortJob(HadoopMapReduceCommitProtocol.scala:141)
>   at 
> org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write(SparkHadoopMapReduceWriter.scala:106)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1085)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at 
> org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1084)
>   ...
> {code}
> So it seems that all the jobs which use OutputFormats which don't write data 
> into HDFS-compatible file systems are broken.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22189) Number of jobs created while querying partitioned table in hive using spark

2017-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22189.
---
Resolution: Invalid

Questions should go to the mailing list, please.

> Number of jobs created while querying partitioned table in hive using spark
> ---
>
> Key: SPARK-22189
> URL: https://issues.apache.org/jira/browse/SPARK-22189
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Astha Arya
>
> I am using Spark SQL 
> Spark version - 1.6.0
> Hive 1.1.0-cdh5.9.0
> When I run hiveContext.sql, creates 2 another job for my case i.e. 3 jobs in 
> total for querying hive for a partitioned table. Whereas when i run the same 
> query on hive using spark as execution engine, it makes only one job. 
> Also, the driver logs show that it lists all the partitions which most likely 
> shouldnt happen because it slows down my execution. 
> Is this a bug? Is there any way to reduce the number of jobs and also not 
> list all the partitions each time I query the same table ? 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2017-10-03 Thread Luca Canali (JIRA)
Luca Canali created SPARK-22190:
---

 Summary: Add Spark executor task metrics to Dropwizard metrics
 Key: SPARK-22190
 URL: https://issues.apache.org/jira/browse/SPARK-22190
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Luca Canali
Priority: Minor


I would like to propose to expose Spark executor task metrics using the 
Dropwizard metrics. I have developed a simple implementation and run a few 
tests using Graphite sink and Grafana visualization and this appears to me a 
good source of information for monitoring and troubleshooting the progress of 
Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22190) Add Spark executor task metrics to Dropwizard metrics

2017-10-03 Thread Luca Canali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali updated SPARK-22190:

Attachment: SparkTaskMetrics_Grafana_example.PNG

!SparkTaskMetrics_Grafana_example.PNG|thumbnail!

> Add Spark executor task metrics to Dropwizard metrics
> -
>
> Key: SPARK-22190
> URL: https://issues.apache.org/jira/browse/SPARK-22190
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Luca Canali
>Priority: Minor
> Attachments: SparkTaskMetrics_Grafana_example.PNG
>
>
> I would like to propose to expose Spark executor task metrics using the 
> Dropwizard metrics. I have developed a simple implementation and run a few 
> tests using Graphite sink and Grafana visualization and this appears to me a 
> good source of information for monitoring and troubleshooting the progress of 
> Spark jobs. I attach a screenshot of an example graph generated with Grafana.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22189) Number of jobs created while querying partitioned table in hive using spark

2017-10-03 Thread Astha Arya (JIRA)
Astha Arya created SPARK-22189:
--

 Summary: Number of jobs created while querying partitioned table 
in hive using spark
 Key: SPARK-22189
 URL: https://issues.apache.org/jira/browse/SPARK-22189
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 1.6.0
Reporter: Astha Arya


I am using Spark SQL 
Spark version - 1.6.0
Hive 1.1.0-cdh5.9.0
When I run hiveContext.sql, creates 2 another job for my case i.e. 3 jobs in 
total for querying hive for a partitioned table. Whereas when i run the same 
query on hive using spark as execution engine, it makes only one job. 
Also, the driver logs show that it lists all the partitions which most likely 
shouldnt happen because it slows down my execution. 
Is this a bug? Is there any way to reduce the number of jobs and also not list 
all the partitions each time I query the same table ? 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189700#comment-16189700
 ] 

Apache Spark commented on SPARK-22188:
--

User 'krishna-pandey' has created a pull request for this issue:
https://github.com/apache/spark/pull/19419

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Krishna Pandey
>Priority: Minor
>  Labels: security
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22188:


Assignee: Apache Spark

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Krishna Pandey
>Assignee: Apache Spark
>Priority: Minor
>  Labels: security
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-03 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22188:


Assignee: (was: Apache Spark)

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Krishna Pandey
>Priority: Minor
>  Labels: security
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-03 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22188:
--
  Shepherd:   (was: Sean Owen)
 Flags:   (was: Important)
  Priority: Minor  (was: Critical)
Issue Type: Improvement  (was: Bug)

Given Spark UIs are internal to corporate networks, I don't think this can be 
considered a significant bug or problem

> Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack
> ---
>
> Key: SPARK-22188
> URL: https://issues.apache.org/jira/browse/SPARK-22188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Krishna Pandey
>Priority: Minor
>  Labels: security
>
> Below HTTP Response headers can be added to improve security.
> The HTTP *Strict-Transport-Security* response header (often abbreviated as 
> HSTS) is a security feature that lets a web site tell browsers that it should 
> only be communicated with using HTTPS, instead of using HTTP.
> *Note:* The Strict-Transport-Security header is ignored by the browser when 
> your site is accessed using HTTP; this is because an attacker may intercept 
> HTTP connections and inject the header or remove it. When your site is 
> accessed over HTTPS with no certificate errors, the browser knows your site 
> is HTTPS capable and will honor the Strict-Transport-Security header.
> *An example scenario*
> You log into a free WiFi access point at an airport and start surfing the 
> web, visiting your online banking service to check your balance and pay a 
> couple of bills. Unfortunately, the access point you're using is actually a 
> hacker's laptop, and they're intercepting your original HTTP request and 
> redirecting you to a clone of your bank's site instead of the real thing. Now 
> your private data is exposed to the hacker.
> Strict Transport Security resolves this problem; as long as you've accessed 
> your bank's web site once using HTTPS, and the bank's web site uses Strict 
> Transport Security, your browser will know to automatically use only HTTPS, 
> which prevents hackers from performing this sort of man-in-the-middle attack.
> *Syntax:*
> Strict-Transport-Security: max-age=
> Strict-Transport-Security: max-age=; includeSubDomains
> Strict-Transport-Security: max-age=; preload
> Read more at 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security
> The HTTP *X-XSS-Protection* response header is a feature of Internet 
> Explorer, Chrome and Safari that stops pages from loading when they detect 
> reflected cross-site scripting (XSS) attacks.
> *Syntax:*
> X-XSS-Protection: 0
> X-XSS-Protection: 1
> X-XSS-Protection: 1; mode=block
> X-XSS-Protection: 1; report=
> Read more at 
> http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection
> The HTTP *X-Content-Type-Options* response header is used to protect against 
> MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
> allows users to upload content to a website however the user disguises a 
> particular file type as something else. This can give them the opportunity to 
> perform cross-site scripting and compromise the website. Read more at 
> https://www.keycdn.com/support/x-content-type-options/ and 
> https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22188) Add defense against Cross-Site Scripting, MIME-sniffing and MitM attack

2017-10-03 Thread Krishna Pandey (JIRA)
Krishna Pandey created SPARK-22188:
--

 Summary: Add defense against Cross-Site Scripting, MIME-sniffing 
and MitM attack
 Key: SPARK-22188
 URL: https://issues.apache.org/jira/browse/SPARK-22188
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Krishna Pandey
Priority: Critical


Below HTTP Response headers can be added to improve security.

The HTTP *Strict-Transport-Security* response header (often abbreviated as 
HSTS) is a security feature that lets a web site tell browsers that it should 
only be communicated with using HTTPS, instead of using HTTP.

*Note:* The Strict-Transport-Security header is ignored by the browser when 
your site is accessed using HTTP; this is because an attacker may intercept 
HTTP connections and inject the header or remove it. When your site is accessed 
over HTTPS with no certificate errors, the browser knows your site is HTTPS 
capable and will honor the Strict-Transport-Security header.

*An example scenario*
You log into a free WiFi access point at an airport and start surfing the web, 
visiting your online banking service to check your balance and pay a couple of 
bills. Unfortunately, the access point you're using is actually a hacker's 
laptop, and they're intercepting your original HTTP request and redirecting you 
to a clone of your bank's site instead of the real thing. Now your private data 
is exposed to the hacker.
Strict Transport Security resolves this problem; as long as you've accessed 
your bank's web site once using HTTPS, and the bank's web site uses Strict 
Transport Security, your browser will know to automatically use only HTTPS, 
which prevents hackers from performing this sort of man-in-the-middle attack.

*Syntax:*
Strict-Transport-Security: max-age=
Strict-Transport-Security: max-age=; includeSubDomains
Strict-Transport-Security: max-age=; preload
Read more at 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Strict-Transport-Security

The HTTP *X-XSS-Protection* response header is a feature of Internet Explorer, 
Chrome and Safari that stops pages from loading when they detect reflected 
cross-site scripting (XSS) attacks.

*Syntax:*
X-XSS-Protection: 0
X-XSS-Protection: 1
X-XSS-Protection: 1; mode=block
X-XSS-Protection: 1; report=
Read more at 
http://sss.jjefwfmpqfs.pjnpajmmb.ljpsh.us3.gsr.awhoer.net/en-US/docs/Web/HTTP/Headers/X-XSS-Protection

The HTTP *X-Content-Type-Options* response header is used to protect against 
MIME sniffing vulnerabilities. These vulnerabilities can occur when a website 
allows users to upload content to a website however the user disguises a 
particular file type as something else. This can give them the opportunity to 
perform cross-site scripting and compromise the website. Read more at 
https://www.keycdn.com/support/x-content-type-options/ and 
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/X-Content-Type-Options



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16709) Task with commit failed will retry infinite when speculation set to true

2017-10-03 Thread Artur Sukhenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artur Sukhenko closed SPARK-16709.
--
   Resolution: Duplicate
Fix Version/s: 1.6.2

> Task with commit failed will retry infinite when speculation set to true
> 
>
> Key: SPARK-16709
> URL: https://issues.apache.org/jira/browse/SPARK-16709
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.0
>Reporter: Hong Shen
> Fix For: 1.6.2
>
> Attachments: commit failed.png
>
>
> In our cluster, we set spark.speculation=true,  but when a task throw 
> exception at SparkHadoopMapRedUtil.performCommit(), this task can retry 
> infinite.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/mapred/SparkHadoopMapRedUtil.scala#L83



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17885) Spark Streaming deletes checkpointed RDD then tries to load it after restart

2017-10-03 Thread Vishal John (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189483#comment-16189483
 ] 

Vishal John edited comment on SPARK-17885 at 10/3/17 11:27 AM:
---

I can see that the checkpointed folder was explicitly deleted - 
INFO dstream.DStreamCheckpointData: Deleted checkpoint file 
'hdfs://nameservice1/user/my-user/checkpoints/my-application/8c683e77-33b9-42ee-80f7-167abb39c241/rdd-401

I was looking at the source code of `cleanup` method in 
`DStreamCheckpointData`. I am curious to know what setting is causing this 
behaviour.

My StreamingContext batch duration is 30 seconds and I haven't provided any 
other time intervals. Should i need to provide any other intervals like 
checkpoint interval or something like that ?

-

UPDATE: I was able to get around this problem by setting 
"spark.streaming.stopGracefullyOnShutdown" to "true""




was (Author: vishaljohn):
I can see that the checkpointed folder was explicitly deleted - 
INFO dstream.DStreamCheckpointData: Deleted checkpoint file 
'hdfs://nameservice1/user/my-user/checkpoints/my-application/8c683e77-33b9-42ee-80f7-167abb39c241/rdd-401

I was looking at the source code of `cleanup` method in 
`DStreamCheckpointData`. I am curious to know what setting is causing this 
behaviour.

My StreamingContext batch duration is 30 seconds and I haven't provided any 
other time intervals. Should i need to provide any other intervals like 
checkpoint interval or something like that ?

> Spark Streaming deletes checkpointed RDD then tries to load it after restart
> 
>
> Key: SPARK-17885
> URL: https://issues.apache.org/jira/browse/SPARK-17885
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Cosmin Ciobanu
>
> The issue is that the Spark driver checkpoints an RDD, deletes it, the job 
> restarts, and the new driver tries to load the deleted checkpoint RDD.
> The application is run in YARN, which attempts to restart the application a 
> number of times (100 in our case), all of which fail due to missing the 
> deleted RDD. 
> Here is a Splunk log which shows the inconsistency in checkpoint behaviour:
> *2016-10-09 02:48:43,533* [streaming-job-executor-0] INFO  
> org.apache.spark.rdd.ReliableRDDCheckpointData - Done checkpointing RDD 73847 
> to 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*, 
> new parent is RDD 73872
> host = ip-10-1-1-13.ec2.internal
> *2016-10-09 02:53:14,696* [JobGenerator] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Deleted checkpoint 
> file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*' 
> for time 147598131 ms
> host = ip-10-1-1-13.ec2.internal
> *Job restarts here, notice driver host change from ip-10-1-1-13.ec2.internal 
> to ip-10-1-1-25.ec2.internal.*
> *2016-10-09 02:53:30,175* [Driver] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Restoring 
> checkpointed RDD for time 147598131 ms from file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*'
> host = ip-10-1-1-25.ec2.internal
> *2016-10-09 02:53:30,491* [Driver] ERROR 
> org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: 
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> host = ip-10-1-1-25.ec2.internal
> Spark streaming is configured with a microbatch interval of 30 seconds, 
> checkpoint interval of 120 seconds, and cleaner.ttl of 28800 (8 hours), but 
> as far as I can tell, this TTL only affects metadata cleanup interval. RDDs 
> seem to be deleted every 4-5 minutes after being checkpointed.
> Running on top of Spark 1.5.1.
> There are at least two possible issues here:
> - In case of a driver restart the new driver tries to load checkpointed RDDs 
> which the previous driver had just deleted;
> - Spark loads stale checkpointed data - the logs show that the deleted RDD 
> was initially checkpointed 4 minutes and 31 seconds before deletion, and 4 
> minutes and 47 seconds before the new driver tries to load it. Given the fact 
> the checkpointing interval is 120 seconds, it makes no sense to load data 
> older than that.
> P.S. Looking at the source code with the event loop that handles checkpoint 
> updates and cleanup, nothing seems to have changed in more recent 

[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/3/17 10:06 AM:
-

Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: 
{noformat} select trans_date and a.trans_hour from table B {noformat}
step2: 
{noformat} select * from tableA where trans_date= and 
a.trans_hour = {noformat}




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/3/17 10:06 AM:
-

Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

{noformat}
sqlContext.sql("select * from tableA a, tableB b where 
a.trans_date=b.trans_date and a.trans_hour=b.trans_hour")
{noformat}

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/3/17 9:52 AM:


Similar problem is observed while joining 2 hive tables based on partition 
columns.

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns.

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround: *
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj edited comment on SPARK-14172 at 10/3/17 9:52 AM:


Similar problem is observed while joining 2 hive tables based on partition 
columns in Spark

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =




was (Author: saktheesh):
Similar problem is observed while joining 2 hive tables based on partition 
columns.

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround*
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17885) Spark Streaming deletes checkpointed RDD then tries to load it after restart

2017-10-03 Thread Vishal John (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189483#comment-16189483
 ] 

Vishal John commented on SPARK-17885:
-

I can see that the checkpointed folder was explicitly deleted - 
INFO dstream.DStreamCheckpointData: Deleted checkpoint file 
'hdfs://nameservice1/user/my-user/checkpoints/my-application/8c683e77-33b9-42ee-80f7-167abb39c241/rdd-401

I was looking at the source code of `cleanup` method in 
`DStreamCheckpointData`. I am curious to know what setting is causing this 
behaviour.

My StreamingContext batch duration is 30 seconds and I haven't provided any 
other time intervals. Should i need to provide any other intervals like 
checkpoint interval or something like that ?

> Spark Streaming deletes checkpointed RDD then tries to load it after restart
> 
>
> Key: SPARK-17885
> URL: https://issues.apache.org/jira/browse/SPARK-17885
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Cosmin Ciobanu
>
> The issue is that the Spark driver checkpoints an RDD, deletes it, the job 
> restarts, and the new driver tries to load the deleted checkpoint RDD.
> The application is run in YARN, which attempts to restart the application a 
> number of times (100 in our case), all of which fail due to missing the 
> deleted RDD. 
> Here is a Splunk log which shows the inconsistency in checkpoint behaviour:
> *2016-10-09 02:48:43,533* [streaming-job-executor-0] INFO  
> org.apache.spark.rdd.ReliableRDDCheckpointData - Done checkpointing RDD 73847 
> to 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*, 
> new parent is RDD 73872
> host = ip-10-1-1-13.ec2.internal
> *2016-10-09 02:53:14,696* [JobGenerator] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Deleted checkpoint 
> file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*' 
> for time 147598131 ms
> host = ip-10-1-1-13.ec2.internal
> *Job restarts here, notice driver host change from ip-10-1-1-13.ec2.internal 
> to ip-10-1-1-25.ec2.internal.*
> *2016-10-09 02:53:30,175* [Driver] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Restoring 
> checkpointed RDD for time 147598131 ms from file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*'
> host = ip-10-1-1-25.ec2.internal
> *2016-10-09 02:53:30,491* [Driver] ERROR 
> org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: 
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> host = ip-10-1-1-25.ec2.internal
> Spark streaming is configured with a microbatch interval of 30 seconds, 
> checkpoint interval of 120 seconds, and cleaner.ttl of 28800 (8 hours), but 
> as far as I can tell, this TTL only affects metadata cleanup interval. RDDs 
> seem to be deleted every 4-5 minutes after being checkpointed.
> Running on top of Spark 1.5.1.
> There are at least two possible issues here:
> - In case of a driver restart the new driver tries to load checkpointed RDDs 
> which the previous driver had just deleted;
> - Spark loads stale checkpointed data - the logs show that the deleted RDD 
> was initially checkpointed 4 minutes and 31 seconds before deletion, and 4 
> minutes and 47 seconds before the new driver tries to load it. Given the fact 
> the checkpointing interval is 120 seconds, it makes no sense to load data 
> older than that.
> P.S. Looking at the source code with the event loop that handles checkpoint 
> updates and cleanup, nothing seems to have changed in more recent versions of 
> Spark, so the bug is likely present in 2.0.1 as well.
> P.P.S. The issue is difficult to reproduce - it only occurs once in every 10 
> or so restarts, and only in clusters with high-load.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14172) Hive table partition predicate not passed down correctly

2017-10-03 Thread Saktheesh Balaraj (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189481#comment-16189481
 ] 

Saktheesh Balaraj commented on SPARK-14172:
---

Similar problem is observed while joining 2 hive tables based on partition 
columns.

*Example*
Table A having 1000 partitions (date partition and hour sub-partition)  and 
Table B having 2 partition. when joining 2 tables based on partition it's going 
for full table scan in table A i.e using all 1000 partitions instead of taking 
2 partitions from Table A and join with Table B.

*select * from tableA a, tableB b where a.trans_date=b.trans_date and 
a.trans_hour=b.trans_hour;*

(Here trans_date is the partition and trans_hour is the sub-partition on both 
the tables)

*Workaround: *
selecting 2 partitions from table B and then do lookup on Table A 
step1: select trans_date and a.trans_hour from table B 
step2: select * from tableA where trans_date= and 
a.trans_hour =



> Hive table partition predicate not passed down correctly
> 
>
> Key: SPARK-14172
> URL: https://issues.apache.org/jira/browse/SPARK-14172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Yingji Zhang
>Priority: Critical
>
> When the hive sql contains nondeterministic fields,  spark plan will not push 
> down the partition predicate to the HiveTableScan. For example:
> {code}
> -- consider following query which uses a random function to sample rows
> SELECT *
> FROM table_a
> WHERE partition_col = 'some_value'
> AND rand() < 0.01;
> {code}
> The spark plan will not push down the partition predicate to HiveTableScan 
> which ends up scanning all partitions data from the table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17885) Spark Streaming deletes checkpointed RDD then tries to load it after restart

2017-10-03 Thread Vishal John (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189403#comment-16189403
 ] 

Vishal John commented on SPARK-17885:
-


Hello all,

Our application also suffers from the same problem. Our application uses Spark 
states(mapWithState) and checkpointed RDDs are getting created in the specified 
checkpoint folder. But when the application is killed then the directory 
containing the checkpointed RDDs is cleared. 
When I launch the application again, it fails because it cannot find the 
checkpoint directory. 

This is the error 'java.lang.IllegalArgumentException: requirement failed: 
Checkpoint directory does not exist: 
hdfs://nameservice1/user/my-user/checkpoints/my-application/77b1dd15-f904-4e80-a5ed-5018224b4df0/rdd-6833'

The applications uses Spark 2.0.2 and it's deployed on Cloudera YARN 
(2.5.0-cdh5.2.0)

Because of this error we are unable to use the checkpointed RDDs and Spark 
states. Can this issue be taken up as priority ?
Please let me know if you require any additional information.
[~tdas][~srowen]

thanks a lot,
Vishal

> Spark Streaming deletes checkpointed RDD then tries to load it after restart
> 
>
> Key: SPARK-17885
> URL: https://issues.apache.org/jira/browse/SPARK-17885
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.5.1
>Reporter: Cosmin Ciobanu
>
> The issue is that the Spark driver checkpoints an RDD, deletes it, the job 
> restarts, and the new driver tries to load the deleted checkpoint RDD.
> The application is run in YARN, which attempts to restart the application a 
> number of times (100 in our case), all of which fail due to missing the 
> deleted RDD. 
> Here is a Splunk log which shows the inconsistency in checkpoint behaviour:
> *2016-10-09 02:48:43,533* [streaming-job-executor-0] INFO  
> org.apache.spark.rdd.ReliableRDDCheckpointData - Done checkpointing RDD 73847 
> to 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*, 
> new parent is RDD 73872
> host = ip-10-1-1-13.ec2.internal
> *2016-10-09 02:53:14,696* [JobGenerator] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Deleted checkpoint 
> file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*' 
> for time 147598131 ms
> host = ip-10-1-1-13.ec2.internal
> *Job restarts here, notice driver host change from ip-10-1-1-13.ec2.internal 
> to ip-10-1-1-25.ec2.internal.*
> *2016-10-09 02:53:30,175* [Driver] INFO  
> org.apache.spark.streaming.dstream.DStreamCheckpointData - Restoring 
> checkpointed RDD for time 147598131 ms from file 
> 'hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*'
> host = ip-10-1-1-25.ec2.internal
> *2016-10-09 02:53:30,491* [Driver] ERROR 
> org.apache.spark.deploy.yarn.ApplicationMaster - User class threw exception: 
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> java.lang.IllegalArgumentException: requirement failed: Checkpoint directory 
> does not exist: 
> hdfs://proc-job/checkpoint/cadf8dcf-ebc2-4366-a2e1-0939976c6ce1/*rdd-73847*
> host = ip-10-1-1-25.ec2.internal
> Spark streaming is configured with a microbatch interval of 30 seconds, 
> checkpoint interval of 120 seconds, and cleaner.ttl of 28800 (8 hours), but 
> as far as I can tell, this TTL only affects metadata cleanup interval. RDDs 
> seem to be deleted every 4-5 minutes after being checkpointed.
> Running on top of Spark 1.5.1.
> There are at least two possible issues here:
> - In case of a driver restart the new driver tries to load checkpointed RDDs 
> which the previous driver had just deleted;
> - Spark loads stale checkpointed data - the logs show that the deleted RDD 
> was initially checkpointed 4 minutes and 31 seconds before deletion, and 4 
> minutes and 47 seconds before the new driver tries to load it. Given the fact 
> the checkpointing interval is 120 seconds, it makes no sense to load data 
> older than that.
> P.S. Looking at the source code with the event loop that handles checkpoint 
> updates and cleanup, nothing seems to have changed in more recent versions of 
> Spark, so the bug is likely present in 2.0.1 as well.
> P.P.S. The issue is difficult to reproduce - it only occurs once in every 10 
> or so restarts, and only in clusters with high-load.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22176) Dataset.show(Int.MaxValue) hits integer overflows

2017-10-03 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22176.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Dataset.show(Int.MaxValue) hits integer overflows
> -
>
> Key: SPARK-22176
> URL: https://issues.apache.org/jira/browse/SPARK-22176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
> Fix For: 2.3.0
>
>
> scala> Seq((1, 2), (3, 4)).toDF("a", "b").show(Int.MaxValue)
> org.apache.spark.sql.AnalysisException: The limit expression must be equal to 
> or greater than 0, but got -2147483648;;
> GlobalLimit -2147483648
> +- LocalLimit -2147483648
>+- Project [_1#27218 AS a#27221, _2#27219 AS b#27222]
>   +- LocalRelation [_1#27218, _2#27219]
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:41)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:89)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$$checkLimitClause(CheckAnalysis.scala:70)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:234)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:80)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22167) Spark Packaging w/R distro issues

2017-10-03 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189309#comment-16189309
 ] 

holdenk commented on SPARK-22167:
-

I agree we could improve this, I think though that swapping install-dev for 
check-cran subset probably belongs in 2.3 instead of a minor patch release (and 
also if that's going to be the case check-cran should probably be renamed).

> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22083) When dropping multiple blocks to disk, Spark should release all locks on a failure

2017-10-03 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-22083:

Fix Version/s: 2.1.2

> When dropping multiple blocks to disk, Spark should release all locks on a 
> failure
> --
>
> Key: SPARK-22083
> URL: https://issues.apache.org/jira/browse/SPARK-22083
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 2.1.2, 2.2.1, 2.3.0, 2.1.3
>
>
> {{MemoryStore.evictBlocksToFreeSpace}} first [acquires writer locks on all 
> the blocks it intends to evict | 
> https://github.com/apache/spark/blob/55d5fa79db883e4d93a9c102a94713c9d2d1fb55/core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala#L520].
>   However, if there is an exception while dropping blocks, there is no 
> {{finally}} block to release all the locks.
> If there is only one block being dropped, this isn't a problem (probably).  
> Usually the call stack goes from {{MemoryStore.evictBlocksToFreeSpace --> 
> dropBlocks --> BlockManager.dropFromMemory --> DiskStore.put}}.  And 
> {{DiskStore.put}} does do a [{{removeBlock()}} in a {{finally}} 
> block|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/DiskStore.scala#L83],
>  which cleans up the locks.
> I ran into this from the serialization issue in SPARK-21928.  In that, a 
> netty thread ends up trying to evict some blocks from memory to disk, and 
> fails.  When there is only one block that needs to be evicted, and the error 
> occurs, there isn't any real problem; I assume that netty thread is dead, but 
> the executor threads seem fine.  However, in the cases where two blocks get 
> dropped, one task gets completely stuck.  Unfortunately I don't have a stack 
> trace from the stuck executor, but I assume it just waits forever on this 
> lock that never gets released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18971) Netty issue may cause the shuffle client hang

2017-10-03 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-18971:

Fix Version/s: 2.1.2

> Netty issue may cause the shuffle client hang
> -
>
> Key: SPARK-18971
> URL: https://issues.apache.org/jira/browse/SPARK-18971
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 2.1.2, 2.2.0, 2.1.3
>
>
> Check https://github.com/netty/netty/issues/6153 for details
> You should be able to see the following similar stack track in the executor 
> thread dump.
> {code}
> "shuffle-client-7-4" daemon prio=5 tid=97 RUNNABLE
> at io.netty.util.Recycler$Stack.scavengeSome(Recycler.java:504)
> at io.netty.util.Recycler$Stack.scavenge(Recycler.java:454)
> at io.netty.util.Recycler$Stack.pop(Recycler.java:435)
> at io.netty.util.Recycler.get(Recycler.java:144)
> at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.newInstance(PooledUnsafeDirectByteBuf.java:39)
> at 
> io.netty.buffer.PoolArena$DirectArena.newByteBuf(PoolArena.java:727)
> at io.netty.buffer.PoolArena.allocate(PoolArena.java:140)
> at 
> io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177)
> at 
> io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168)
> at 
> io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:129)
> at 
> io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org