date:20170929

[jira] [Updated] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22173:
--
Issue Type: Improvement  (was: Bug)

> Table CSS style needs to be adjusted in History Page and in Executors Page.
> ---
>
> Key: SPARK-22173
> URL: https://issues.apache.org/jira/browse/SPARK-22173
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> There is a problem with table CSS style.
> 1. At present, table CSS style is too crowded, and the table width cannot 
> adapt itself.
> 2. Table CSS style is different from job page, stage page, task page, master 
> page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22122) Respect WITH clauses to count input rows in TPCDSQueryBenchmark

2017-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22122.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.3.0

> Respect WITH clauses to count input rows in TPCDSQueryBenchmark
> ---
>
> Key: SPARK-22122
> URL: https://issues.apache.org/jira/browse/SPARK-22122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Since the current code ignores WITH clauses to check input relations in TPCDS 
> queries, this leads to inaccurate per-row processing time for benchmark 
> results. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22175) Add status column to history page

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22175:
--
Priority: Trivial  (was: Major)

> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
>Priority: Trivial
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png!
> After adding:
> !after.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22163.
---
Resolution: Duplicate

[~michaeln_apache] Don't reopen this issue. It isn't a bug and forks the 
discussion.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-29 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186893#comment-16186893
 ] 

Sean Owen commented on SPARK-21999:
---

Please don't fork the discussion.

Nothing about this suggests a design problem; at best you're asking questions 
about how the app works.
Your app causes a data structure to serialize for use in a function that you 
ask Spark Streaming to execute. It inherently executes that asynchronously. You 
modify the collection asynchronously in your app. There is no guarantee about 
the exact moment those things might happen.

This isn't a bug, but if you want to continue the discussion,  you can raise it 
on dev@ with a more specific example of the behavior your'e asking about.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
>

[jira] [Closed] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-22163.
-

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22175) Add status column to history page

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22175:


Assignee: (was: Apache Spark)

> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png!
> After adding:
> !after.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22175) Add status column to history page

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22175:


Assignee: Apache Spark

> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
>Assignee: Apache Spark
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png!
> After adding:
> !after.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22175) Add status column to history page

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186832#comment-16186832
 ] 

Apache Spark commented on SPARK-22175:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19399

> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png!
> After adding:
> !after.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22175) Add status column to history page

2017-09-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22175:
-
Description: 
Currently, the history page has no status column which represent the status of 
application.
Before adding:
!before.png!
After adding:
!after.png!

  was:
Currently, the history page has no status column which represent the status of 
application.
Before adding:
!before.png|thumbnail!
After adding:
!after.png|thumbnail!


> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png!
> After adding:
> !after.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22175) Add status column to history page

2017-09-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22175:
-
Description: 
Currently, the history page has no status column which represent the status of 
application.
Before adding:
!before.png|thumbnail!
After adding:
!after.png|thumbnail!

  was:
Currently, the history page has no status column which represent the status of 
application.
Before adding:
!before.jpg|thumbnail!
After adding:
!after.jpg|thumbnail!


> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.png|thumbnail!
> After adding:
> !after.png|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22175) Add status column to history page

2017-09-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22175:
-
Attachment: after.png
before.png

> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> After adding:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22175) Add status column to history page

2017-09-29 Thread zhoukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-22175:
-
Description: 
Currently, the history page has no status column which represent the status of 
application.
Before adding:
!before.jpg|thumbnail!
After adding:
!after.jpg|thumbnail!

  was:
Currently, the history page has no status column which represent the status of 
application.
Before adding:

After adding:


> Add status column to history page
> -
>
> Key: SPARK-22175
> URL: https://issues.apache.org/jira/browse/SPARK-22175
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Web UI
>Affects Versions: 2.1.0, 2.2.0
>Reporter: zhoukang
> Attachments: after.png, before.png
>
>
> Currently, the history page has no status column which represent the status 
> of application.
> Before adding:
> !before.jpg|thumbnail!
> After adding:
> !after.jpg|thumbnail!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22175) Add status column to history page

2017-09-29 Thread zhoukang (JIRA)

zhoukang created SPARK-22175:


 Summary: Add status column to history page
 Key: SPARK-22175
 URL: https://issues.apache.org/jira/browse/SPARK-22175
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Web UI
Affects Versions: 2.2.0, 2.1.0
Reporter: zhoukang


Currently, the history page has no status column which represent the status of 
application.
Before adding:

After adding:



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-22174.

Resolution: Duplicate

See duplicate bug for discussion. Also please close the PR. This feature will 
not be added to Spark.

> Support to automatically create the directory where the event logs go 
> (`spark.eventLog.dir`) 
> -
>
> Key: SPARK-22174
> URL: https://issues.apache.org/jira/browse/SPARK-22174
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Minor
>
> {code:java}
> 2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error 
> initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> {code}
> Currently, if our applications are using event logging, the directory where 
> the event logs go (`spark.eventLog.dir`) should be manually created.
> I suggest to create the event log directory automatically in source code, 
> this will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21904) Rename tempTables to tempViews in SessionCatalog

2017-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21904.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Rename tempTables to tempViews in SessionCatalog
> 
>
> Key: SPARK-21904
> URL: https://issues.apache.org/jira/browse/SPARK-21904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22174:


Assignee: (was: Apache Spark)

> Support to automatically create the directory where the event logs go 
> (`spark.eventLog.dir`) 
> -
>
> Key: SPARK-22174
> URL: https://issues.apache.org/jira/browse/SPARK-22174
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Minor
>
> {code:java}
> 2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error 
> initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> {code}
> Currently, if our applications are using event logging, the directory where 
> the event logs go (`spark.eventLog.dir`) should be manually created.
> I suggest to create the event log directory automatically in source code, 
> this will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186808#comment-16186808
 ] 

Apache Spark commented on SPARK-22174:
--

User 'zuotingbing' has created a pull request for this issue:
https://github.com/apache/spark/pull/19398

> Support to automatically create the directory where the event logs go 
> (`spark.eventLog.dir`) 
> -
>
> Key: SPARK-22174
> URL: https://issues.apache.org/jira/browse/SPARK-22174
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Minor
>
> {code:java}
> 2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error 
> initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> {code}
> Currently, if our applications are using event logging, the directory where 
> the event logs go (`spark.eventLog.dir`) should be manually created.
> I suggest to create the event log directory automatically in source code, 
> this will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22174:


Assignee: Apache Spark

> Support to automatically create the directory where the event logs go 
> (`spark.eventLog.dir`) 
> -
>
> Key: SPARK-22174
> URL: https://issues.apache.org/jira/browse/SPARK-22174
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> 2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error 
> initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> {code}
> Currently, if our applications are using event logging, the directory where 
> the event logs go (`spark.eventLog.dir`) should be manually created.
> I suggest to create the event log directory automatically in source code, 
> this will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread zuotingbing (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zuotingbing updated SPARK-22174:

Issue Type: Bug  (was: Improvement)

> Support to automatically create the directory where the event logs go 
> (`spark.eventLog.dir`) 
> -
>
> Key: SPARK-22174
> URL: https://issues.apache.org/jira/browse/SPARK-22174
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: zuotingbing
>Priority: Minor
>
> {code:java}
> 2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error 
> initializing SparkContext.
> java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
>   at org.apache.spark.SparkContext.(SparkContext.scala:516)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
>   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
>   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
> {code}
> Currently, if our applications are using event logging, the directory where 
> the event logs go (`spark.eventLog.dir`) should be manually created.
> I suggest to create the event log directory automatically in source code, 
> this will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22174) Support to automatically create the directory where the event logs go (`spark.eventLog.dir`)

2017-09-29 Thread zuotingbing (JIRA)

zuotingbing created SPARK-22174:
---

 Summary: Support to automatically create the directory where the 
event logs go (`spark.eventLog.dir`) 
 Key: SPARK-22174
 URL: https://issues.apache.org/jira/browse/SPARK-22174
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: zuotingbing
Priority: Minor


{code:java}
2017-09-30 09:47:44,721 ERROR org.apache.spark.SparkContext: Error initializing 
SparkContext.
java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at 
org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:93)
at org.apache.spark.SparkContext.(SparkContext.scala:516)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:846)
at 
org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:838)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:838)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
{code}

Currently, if our applications are using event logging, the directory where the 
event logs go (`spark.eventLog.dir`) should be manually created.
I suggest to create the event log directory automatically in source code, this 
will make spark more convenient to use.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22173:


Assignee: (was: Apache Spark)

> Table CSS style needs to be adjusted in History Page and in Executors Page.
> ---
>
> Key: SPARK-22173
> URL: https://issues.apache.org/jira/browse/SPARK-22173
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> There is a problem with table CSS style.
> 1. At present, table CSS style is too crowded, and the table width cannot 
> adapt itself.
> 2. Table CSS style is different from job page, stage page, task page, master 
> page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186793#comment-16186793
 ] 

Apache Spark commented on SPARK-22173:
--

User 'guoxiaolongzte' has created a pull request for this issue:
https://github.com/apache/spark/pull/19397

> Table CSS style needs to be adjusted in History Page and in Executors Page.
> ---
>
> Key: SPARK-22173
> URL: https://issues.apache.org/jira/browse/SPARK-22173
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> There is a problem with table CSS style.
> 1. At present, table CSS style is too crowded, and the table width cannot 
> adapt itself.
> 2. Table CSS style is different from job page, stage page, task page, master 
> page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22173:


Assignee: Apache Spark

> Table CSS style needs to be adjusted in History Page and in Executors Page.
> ---
>
> Key: SPARK-22173
> URL: https://issues.apache.org/jira/browse/SPARK-22173
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Assignee: Apache Spark
>Priority: Trivial
>
> There is a problem with table CSS style.
> 1. At present, table CSS style is too crowded, and the table width cannot 
> adapt itself.
> 2. Table CSS style is different from job page, stage page, task page, master 
> page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186789#comment-16186789
 ] 

Alex Bozarth commented on SPARK-22173:
--

Just as a note, https://github.com/apache/spark/pull/19270 is currently open 
moving the tasks page to data tables as well, so this might be better to hold 
off on until after that gets merged.

> Table CSS style needs to be adjusted in History Page and in Executors Page.
> ---
>
> Key: SPARK-22173
> URL: https://issues.apache.org/jira/browse/SPARK-22173
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Trivial
>
> There is a problem with table CSS style.
> 1. At present, table CSS style is too crowded, and the table width cannot 
> adapt itself.
> 2. Table CSS style is different from job page, stage page, task page, master 
> page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22173) Table CSS style needs to be adjusted in History Page and in Executors Page.

2017-09-29 Thread guoxiaolongzte (JIRA)

guoxiaolongzte created SPARK-22173:
--

 Summary: Table CSS style needs to be adjusted in History Page and 
in Executors Page.
 Key: SPARK-22173
 URL: https://issues.apache.org/jira/browse/SPARK-22173
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.3.0
Reporter: guoxiaolongzte
Priority: Trivial


There is a problem with table CSS style.

1. At present, table CSS style is too crowded, and the table width cannot adapt 
itself.

2. Table CSS style is different from job page, stage page, task page, master 
page, worker page, etc. The Spark web UI needs to be consistent.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22172) Worker hangs when the external shuffle service port is already in use

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22172:


Assignee: Apache Spark

> Worker hangs when the external shuffle service port is already in use
> -
>
> Key: SPARK-22172
> URL: https://issues.apache.org/jira/browse/SPARK-22172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>Assignee: Apache Spark
>
> When the external shuffle service port is already in use, Worker throws the 
> below BindException and hangs forever, I think the exception should be 
> handled gracefully. 
> {code:xml}
> 17/09/29 11:16:30 INFO ExternalShuffleService: Starting shuffle service on 
> port 7337 (auth enabled = false)
> 17/09/29 11:16:30 ERROR Inbox: Ignoring error
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:433)
> at sun.nio.ch.Net.bind(Net.java:425)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:500)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
> at 
> io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
> at 
> io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
> at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:209)
> at 
> io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:355)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22172) Worker hangs when the external shuffle service port is already in use

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22172:


Assignee: (was: Apache Spark)

> Worker hangs when the external shuffle service port is already in use
> -
>
> Key: SPARK-22172
> URL: https://issues.apache.org/jira/browse/SPARK-22172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> When the external shuffle service port is already in use, Worker throws the 
> below BindException and hangs forever, I think the exception should be 
> handled gracefully. 
> {code:xml}
> 17/09/29 11:16:30 INFO ExternalShuffleService: Starting shuffle service on 
> port 7337 (auth enabled = false)
> 17/09/29 11:16:30 ERROR Inbox: Ignoring error
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:433)
> at sun.nio.ch.Net.bind(Net.java:425)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:500)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
> at 
> io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
> at 
> io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
> at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:209)
> at 
> io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:355)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22172) Worker hangs when the external shuffle service port is already in use

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186779#comment-16186779
 ] 

Apache Spark commented on SPARK-22172:
--

User 'devaraj-kavali' has created a pull request for this issue:
https://github.com/apache/spark/pull/19396

> Worker hangs when the external shuffle service port is already in use
> -
>
> Key: SPARK-22172
> URL: https://issues.apache.org/jira/browse/SPARK-22172
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Devaraj K
>
> When the external shuffle service port is already in use, Worker throws the 
> below BindException and hangs forever, I think the exception should be 
> handled gracefully. 
> {code:xml}
> 17/09/29 11:16:30 INFO ExternalShuffleService: Starting shuffle service on 
> port 7337 (auth enabled = false)
> 17/09/29 11:16:30 ERROR Inbox: Ignoring error
> java.net.BindException: Address already in use
> at sun.nio.ch.Net.bind0(Native Method)
> at sun.nio.ch.Net.bind(Net.java:433)
> at sun.nio.ch.Net.bind(Net.java:425)
> at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
> at 
> io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:500)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
> at 
> io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
> at 
> io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
> at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:209)
> at 
> io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:355)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22172) Worker hangs when the external shuffle service port is already in use

2017-09-29 Thread Devaraj K (JIRA)

Devaraj K created SPARK-22172:
-

 Summary: Worker hangs when the external shuffle service port is 
already in use
 Key: SPARK-22172
 URL: https://issues.apache.org/jira/browse/SPARK-22172
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Devaraj K


When the external shuffle service port is already in use, Worker throws the 
below BindException and hangs forever, I think the exception should be handled 
gracefully. 

{code:xml}
17/09/29 11:16:30 INFO ExternalShuffleService: Starting shuffle service on port 
7337 (auth enabled = false)
17/09/29 11:16:30 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at 
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at 
io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:128)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:500)
at 
io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:495)
at 
io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:480)
at 
io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:209)
at 
io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:355)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)

{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22171) Describe Table Extended Failed when Table Owner is Empty

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186739#comment-16186739
 ] 

Apache Spark commented on SPARK-22171:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19395

> Describe Table Extended Failed when Table Owner is Empty
> 
>
> Key: SPARK-22171
> URL: https://issues.apache.org/jira/browse/SPARK-22171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Users could hit `java.lang.NullPointerException` when the tables were created 
> by Hive and the table's owner is `null` that are got from Hive metastore. 
> `DESC EXTENDED` failed with the error:
> {noformat}
> SQLExecutionException: java.lang.NullPointerException at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at 
> scala.collection.immutable.StringOps.length(StringOps.scala:47) at 
> scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
>  at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at 
> scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at 
> scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at 
> org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66)
>  at 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22171) Describe Table Extended Failed when Table Owner is Empty

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22171:


Assignee: Apache Spark  (was: Xiao Li)

> Describe Table Extended Failed when Table Owner is Empty
> 
>
> Key: SPARK-22171
> URL: https://issues.apache.org/jira/browse/SPARK-22171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Users could hit `java.lang.NullPointerException` when the tables were created 
> by Hive and the table's owner is `null` that are got from Hive metastore. 
> `DESC EXTENDED` failed with the error:
> {noformat}
> SQLExecutionException: java.lang.NullPointerException at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at 
> scala.collection.immutable.StringOps.length(StringOps.scala:47) at 
> scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
>  at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at 
> scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at 
> scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at 
> org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66)
>  at 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22171) Describe Table Extended Failed when Table Owner is Empty

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22171:


Assignee: Xiao Li  (was: Apache Spark)

> Describe Table Extended Failed when Table Owner is Empty
> 
>
> Key: SPARK-22171
> URL: https://issues.apache.org/jira/browse/SPARK-22171
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Users could hit `java.lang.NullPointerException` when the tables were created 
> by Hive and the table's owner is `null` that are got from Hive metastore. 
> `DESC EXTENDED` failed with the error:
> {noformat}
> SQLExecutionException: java.lang.NullPointerException at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at 
> scala.collection.immutable.StringOps.length(StringOps.scala:47) at 
> scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
>  at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at 
> scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at 
> scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at 
> org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565)
>  at 
> org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66)
>  at 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22171) Describe Table Extended Failed when Table Owner is Empty

2017-09-29 Thread Xiao Li (JIRA)

Xiao Li created SPARK-22171:
---

 Summary: Describe Table Extended Failed when Table Owner is Empty
 Key: SPARK-22171
 URL: https://issues.apache.org/jira/browse/SPARK-22171
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Xiao Li
Assignee: Xiao Li


Users could hit `java.lang.NullPointerException` when the tables were created 
by Hive and the table's owner is `null` that are got from Hive metastore. `DESC 
EXTENDED` failed with the error:
{noformat}
SQLExecutionException: java.lang.NullPointerException at 
scala.collection.immutable.StringOps$.length$extension(StringOps.scala:47) at 
scala.collection.immutable.StringOps.length(StringOps.scala:47) at 
scala.collection.IndexedSeqOptimized$class.isEmpty(IndexedSeqOptimized.scala:27)
 at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:29) at 
scala.collection.TraversableOnce$class.nonEmpty(TraversableOnce.scala:111) at 
scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:29) at 
org.apache.spark.sql.catalyst.catalog.CatalogTable.toLinkedHashMap(interface.scala:300)
 at 
org.apache.spark.sql.execution.command.DescribeTableCommand.describeFormattedTableInfo(tables.scala:565)
 at 
org.apache.spark.sql.execution.command.DescribeTableCommand.run(tables.scala:543)
 at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:66)
 at 
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22170) Broadcast join holds an extra copy of rows in driver memory

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22170:


Assignee: (was: Apache Spark)

> Broadcast join holds an extra copy of rows in driver memory
> ---
>
> Key: SPARK-22170
> URL: https://issues.apache.org/jira/browse/SPARK-22170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Ryan Blue
>
> I investigated a driver OOM that was building a large broadcast table with a 
> memory profiler and found that a huge amount of memory is used while building 
> a broadcast table. This is because [BroadcastExchangeExec uses 
> {{executeCollect}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L76].
>  In {{executeCollect}}, all of the partitions are fetched as compressed 
> blocks, then each block is decompressed (with a stream), and each row is 
> copied to a new byte buffer and added to an ArrayBuffer, which is copied to 
> an Array. This results in a huge amount of allocation: a buffer for each row 
> in the broadcast. Those rows are only used to get copied into a 
> {{BytesToBytesMap}} that will be broadcasted, so there is no need to keep 
> them in memory.
> Replacing the array buffer step with an iterator reduces the amount of memory 
> held while creating the map by not requiring all rows to be in memory. It 
> also avoids allocating a large Array for the rows. In practice, a 16MB 
> broadcast table used 100MB less memory with this approach, but the reduction 
> depends on the size of rows and compression (16MB was in Parquet format).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22170) Broadcast join holds an extra copy of rows in driver memory

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22170:


Assignee: Apache Spark

> Broadcast join holds an extra copy of rows in driver memory
> ---
>
> Key: SPARK-22170
> URL: https://issues.apache.org/jira/browse/SPARK-22170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>
> I investigated a driver OOM that was building a large broadcast table with a 
> memory profiler and found that a huge amount of memory is used while building 
> a broadcast table. This is because [BroadcastExchangeExec uses 
> {{executeCollect}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L76].
>  In {{executeCollect}}, all of the partitions are fetched as compressed 
> blocks, then each block is decompressed (with a stream), and each row is 
> copied to a new byte buffer and added to an ArrayBuffer, which is copied to 
> an Array. This results in a huge amount of allocation: a buffer for each row 
> in the broadcast. Those rows are only used to get copied into a 
> {{BytesToBytesMap}} that will be broadcasted, so there is no need to keep 
> them in memory.
> Replacing the array buffer step with an iterator reduces the amount of memory 
> held while creating the map by not requiring all rows to be in memory. It 
> also avoids allocating a large Array for the rows. In practice, a 16MB 
> broadcast table used 100MB less memory with this approach, but the reduction 
> depends on the size of rows and compression (16MB was in Parquet format).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22170) Broadcast join holds an extra copy of rows in driver memory

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186713#comment-16186713
 ] 

Apache Spark commented on SPARK-22170:
--

User 'rdblue' has created a pull request for this issue:
https://github.com/apache/spark/pull/19394

> Broadcast join holds an extra copy of rows in driver memory
> ---
>
> Key: SPARK-22170
> URL: https://issues.apache.org/jira/browse/SPARK-22170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Ryan Blue
>
> I investigated a driver OOM that was building a large broadcast table with a 
> memory profiler and found that a huge amount of memory is used while building 
> a broadcast table. This is because [BroadcastExchangeExec uses 
> {{executeCollect}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L76].
>  In {{executeCollect}}, all of the partitions are fetched as compressed 
> blocks, then each block is decompressed (with a stream), and each row is 
> copied to a new byte buffer and added to an ArrayBuffer, which is copied to 
> an Array. This results in a huge amount of allocation: a buffer for each row 
> in the broadcast. Those rows are only used to get copied into a 
> {{BytesToBytesMap}} that will be broadcasted, so there is no need to keep 
> them in memory.
> Replacing the array buffer step with an iterator reduces the amount of memory 
> held while creating the map by not requiring all rows to be in memory. It 
> also avoids allocating a large Array for the rows. In practice, a 16MB 
> broadcast table used 100MB less memory with this approach, but the reduction 
> depends on the size of rows and compression (16MB was in Parquet format).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22170) Broadcast join holds an extra copy of rows in driver memory

2017-09-29 Thread Ryan Blue (JIRA)

Ryan Blue created SPARK-22170:
-

 Summary: Broadcast join holds an extra copy of rows in driver 
memory
 Key: SPARK-22170
 URL: https://issues.apache.org/jira/browse/SPARK-22170
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.0.2
Reporter: Ryan Blue


I investigated a driver OOM that was building a large broadcast table with a 
memory profiler and found that a huge amount of memory is used while building a 
broadcast table. This is because [BroadcastExchangeExec uses 
{{executeCollect}}|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L76].
 In {{executeCollect}}, all of the partitions are fetched as compressed blocks, 
then each block is decompressed (with a stream), and each row is copied to a 
new byte buffer and added to an ArrayBuffer, which is copied to an Array. This 
results in a huge amount of allocation: a buffer for each row in the broadcast. 
Those rows are only used to get copied into a {{BytesToBytesMap}} that will be 
broadcasted, so there is no need to keep them in memory.

Replacing the array buffer step with an iterator reduces the amount of memory 
held while creating the map by not requiring all rows to be in memory. It also 
avoids allocating a large Array for the rows. In practice, a 16MB broadcast 
table used 100MB less memory with this approach, but the reduction depends on 
the size of rows and compression (16MB was in Parquet format).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-29 Thread Michael N (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186682#comment-16186682
 ] 

Michael N edited comment on SPARK-21999 at 9/29/17 11:35 PM:
-

You were looking at the mechanics of locking.  However, let's look the 
questions at the Spark framework design issue that precede that mechanics:

1. In the first place, why does Spark  serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

This is why I created a separate ticket at 
https://issues.apache.org/jira/browse/SPARK-22163 to addresses these design 
issues at the broader scope. We need to distinguish between coding issues vs 
design issues.


was (Author: michaeln_apache):
You were looking at the mechanics of locking.  However, let's look the 
questions at the design scope that precede that mechanics:

1. In the first place, why does Spark  serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

This is why I created a separate ticket at 
https://issues.apache.org/jira/browse/SPARK-22163 to addresses these design 
issues at the broader scope. We need to distinguish between coding issues vs 
design issues.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
>

[jira] [Reopened] (SPARK-22163) Design Issue of Spark Streaming that Causes Random Run-time Exception

2017-09-29 Thread Michael N (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael N reopened SPARK-22163:
---

This ticket is *not*  duplicate of ticket 
https://issues.apache.org/jira/browse/SPARK-21999. This ticket addresses the 
Spark framework design issue at the broader scope that precede that ticket.  We 
need to distinguish between coding issues vs design issues.

> Design Issue of Spark Streaming that Causes Random Run-time Exception
> -
>
> Key: SPARK-22163
> URL: https://issues.apache.org/jira/browse/SPARK-22163
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Structured Streaming
>Affects Versions: 2.2.0
> Environment: Spark Streaming
> Kafka
> Linux
>Reporter: Michael N
>Priority: Critical
>
> The application objects can contain List and can be modified dynamically as 
> well.   However, Spark Streaming framework asynchronously serializes the 
> application's objects as the application runs.  Therefore, it causes random 
> run-time exception on the List when Spark Streaming framework happens to 
> serializes the application's objects while the application modifies a List in 
> its own object.  
> In fact, there are multiple bugs reported about
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject
> that are permutation of the same root cause. So the design issue of Spark 
> streaming framework is that it should do this serialization asynchronously.  
> Instead, it should either
> 1. do this serialization synchronously. This is preferred to eliminate the 
> issue completely.  Or
> 2. Allow it to be configured per application whether to do this serialization 
> synchronously or asynchronously, depending on the nature of each application.
> Also, Spark documentation should describe the conditions that trigger Spark 
> to do this type of serialization asynchronously, so the applications can work 
> around them until the fix is provided. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21999) ConcurrentModificationException - Spark Streaming

2017-09-29 Thread Michael N (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186682#comment-16186682
 ] 

Michael N commented on SPARK-21999:
---

You were looking at the mechanics of locking.  However, let's look the 
questions at the design scope that precede that mechanics:

1. In the first place, why does Spark  serialize the application objects 
asynchronously while the streaming application is running continuously from 
batch to batch ?

2. If Spark needs to do this type of serialization at all, why does it not do 
at the end of the batch ?

This is why I created a separate ticket at 
https://issues.apache.org/jira/browse/SPARK-22163 to addresses these design 
issues at the broader scope. We need to distinguish between coding issues vs 
design issues.

> ConcurrentModificationException - Spark Streaming
> -
>
> Key: SPARK-21999
> URL: https://issues.apache.org/jira/browse/SPARK-21999
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Michael N
>Priority: Critical
>
> Hi,
> I am using Spark Streaming v2.1.0 with Kafka 0.8.  I am getting 
> ConcurrentModificationException intermittently.  When it occurs, Spark does 
> not honor the specified value of spark.task.maxFailures. So Spark aborts the 
> current batch  and fetch the next batch, so it results in lost data. Its 
> exception stack is listed below. 
> This instance of ConcurrentModificationException is similar to the issue at 
> https://issues.apache.org/jira/browse/SPARK-17463, which was about 
> Serialization of accumulators in heartbeats.  However, my Spark stream app 
> does not use accumulators. 
> The stack trace listed below occurred on the Spark master in Spark streaming 
> driver at the time of data loss.   
> From the line of code in the first stack trace, can you tell which object 
> Spark was trying to serialize ?  What is the root cause for this issue  ?  
> Because this issue results in lost data as described above, could you have 
> this issue fixed ASAP ?
> Thanks.
> Michael N.,
> 
> Stack trace of Spark Streaming driver
> ERROR JobScheduler:91: Error generating jobs for time 150522493 ms
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2094)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:792)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
>   at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:792)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream$$anonfun$compute$1.apply(MapPartitionedDStream.scala:37)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.streaming.dstream.MapPartitionedDStream.compute(MapPartitionedDStream.scala:37)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:341)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:340)
>   at 
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:415)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:335)
>   at 
> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:333)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:330)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
> org.apache.spark.streaming.dstream.TransformedDStream$$anonfun$6.apply(TransformedDStream.scala:42)
>   at 
>

[jira] [Commented] (SPARK-21644) LocalLimit.maxRows is defined incorrectly

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186464#comment-16186464
 ] 

Apache Spark commented on SPARK-21644:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/19393

> LocalLimit.maxRows is defined incorrectly
> -
>
> Key: SPARK-21644
> URL: https://issues.apache.org/jira/browse/SPARK-21644
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> {code}
> case class LocalLimit(limitExpr: Expression, child: LogicalPlan) extends 
> UnaryNode {
>   override def output: Seq[Attribute] = child.output
>   override def maxRows: Option[Long] = {
> limitExpr match {
>   case IntegerLiteral(limit) => Some(limit)
>   case _ => None
> }
>   }
> }
> {code}
> This is simply wrong, since LocalLimit is only about partition level limits.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-09-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186062#comment-16186062
 ] 

Marcelo Vanzin commented on SPARK-18838:


Not really. You could try to backport the patch (not trivial) and build your 
own Spark.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22169) table name with numbers and characters should be able to be parsed

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22169:


Assignee: Wenchen Fan  (was: Apache Spark)

> table name with numbers and characters should be able to be parsed
> --
>
> Key: SPARK-22169
> URL: https://issues.apache.org/jira/browse/SPARK-22169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22169) table name with numbers and characters should be able to be parsed

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186033#comment-16186033
 ] 

Apache Spark commented on SPARK-22169:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19392

> table name with numbers and characters should be able to be parsed
> --
>
> Key: SPARK-22169
> URL: https://issues.apache.org/jira/browse/SPARK-22169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22169) table name with numbers and characters should be able to be parsed

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22169:


Assignee: Apache Spark  (was: Wenchen Fan)

> table name with numbers and characters should be able to be parsed
> --
>
> Key: SPARK-22169
> URL: https://issues.apache.org/jira/browse/SPARK-22169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22169) table name with numbers and characters should be able to be parsed

2017-09-29 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-22169:
---

 Summary: table name with numbers and characters should be able to 
be parsed
 Key: SPARK-22169
 URL: https://issues.apache.org/jira/browse/SPARK-22169
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22161) Add Impala-modified TPC-DS queries

2017-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22161.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Add Impala-modified TPC-DS queries
> --
>
> Key: SPARK-22161
> URL: https://issues.apache.org/jira/browse/SPARK-22161
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.2.1, 2.3.0
>
>
> Added IMPALA-modified TPCDS queries to TPC-DS query suites.
> - Ref: https://github.com/cloudera/impala-tpcds-kit/tree/master/queries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22167) Spark Packaging w/R distro issues

2017-09-29 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186008#comment-16186008
 ] 

holdenk commented on SPARK-22167:
-

So for some reason the R directory in the hadoop 2.7 build looks like:
holden@holden:~/repos/spark/spark-2.1.2-bin-hadoop2.7$ ls R
check-cran.sh  CRAN_RELEASE.md  create-docs.sh  DOCUMENTATION.md  
install-dev.bat  install-dev.sh  log4j.properties  pkg  README.md  run-tests.sh 
 WINDOWS.md
holden@holden:~/repos/spark/spark-2.1.2-bin-hadoop2.7$ 

I think there is a race condition which only shows up on my laptop where the 
Spark directory is modified in one of the previous build steps before copying 
into the hadoop-2.7 version.


> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22168) py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed out

2017-09-29 Thread Krishnaprasad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186004#comment-16186004
 ] 

Krishnaprasad commented on SPARK-22168:
---

Thanks for the reply Owen. I will post this question on the forums that you had 
suggested.

Regards,
Krishnaprasad

> py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed 
> out
> ---
>
> Key: SPARK-22168
> URL: https://issues.apache.org/jira/browse/SPARK-22168
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: Linux - Ubuntu 14.04, Python 3.4
>Reporter: Krishnaprasad
>  Labels: None
>
> Hi all,
> I am looking for a resolution / workaround for the below problem. It will be 
> helpful If somebbody can suggest a quick solution to this problem
> Traceback (most recent call last):
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1028, in send_command
> answer = smart_decode(self.stream.readline()[:-1])
>   File "/usr/lib/python3.4/socket.py", line 374, in readinto
> return self._sock.recv_into(b)
> socket.timeout: timed out
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 883, in send_command
> response = connection.send_command(command)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1040, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Process Process-1:
> Traceback (most recent call last):
>   File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in 
> _bootstrap
> self.run()
>   File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
> self._target(*self._args, **self._kwargs)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 327, in get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o180.fit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16186001#comment-16186001
 ] 

Apache Spark commented on SPARK-22146:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19391

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22168) py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed out

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22168.
---
Resolution: Invalid

By itself this error doesn't many anything. It just says something else went 
wrong.
JIRA isn't the right place for this; StackOverflow or the mailing list, after 
you give more detail, might be.

> py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed 
> out
> ---
>
> Key: SPARK-22168
> URL: https://issues.apache.org/jira/browse/SPARK-22168
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
> Environment: Linux - Ubuntu 14.04, Python 3.4
>Reporter: Krishnaprasad
>  Labels: None
>
> Hi all,
> I am looking for a resolution / workaround for the below problem. It will be 
> helpful If somebbody can suggest a quick solution to this problem
> Traceback (most recent call last):
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1028, in send_command
> answer = smart_decode(self.stream.readline()[:-1])
>   File "/usr/lib/python3.4/socket.py", line 374, in readinto
> return self._sock.recv_into(b)
> socket.timeout: timed out
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 883, in send_command
> response = connection.send_command(command)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1040, in send_command
> "Error while receiving", e, proto.ERROR_ON_RECEIVE)
> py4j.protocol.Py4JNetworkError: Error while receiving
> Process Process-1:
> Traceback (most recent call last):
>   File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in 
> _bootstrap
> self.run()
>   File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
> self._target(*self._args, **self._kwargs)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>  line 1133, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "/usr/local/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>  line 327, in get_return_value
> format(target_id, ".", name))
> py4j.protocol.Py4JError: An error occurred while calling o180.fit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22167) Spark Packaging w/R distro issues

2017-09-29 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185959#comment-16185959
 ] 

holdenk commented on SPARK-22167:
-

Here is the build log 
https://gist.github.com/holdenk/8d8bf00a0fc2186bdcf46e2c8748d365

> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22168) py4j.protocol.Py4JNetworkError: Error while receiving Socket.timeout: timed out

2017-09-29 Thread Krishnaprasad (JIRA)

Krishnaprasad created SPARK-22168:
-

 Summary: py4j.protocol.Py4JNetworkError: Error while receiving 
Socket.timeout: timed out
 Key: SPARK-22168
 URL: https://issues.apache.org/jira/browse/SPARK-22168
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
 Environment: Linux - Ubuntu 14.04, Python 3.4
Reporter: Krishnaprasad


Hi all,

I am looking for a resolution / workaround for the below problem. It will be 
helpful If somebbody can suggest a quick solution to this problem

Traceback (most recent call last):
  File 
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1028, in send_command
answer = smart_decode(self.stream.readline()[:-1])
  File "/usr/lib/python3.4/socket.py", line 374, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File 
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 883, in send_command
response = connection.send_command(command)
  File 
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
self.run()
  File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
  File 
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
 line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/spark-2.2.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"/usr/local/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
 line 327, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling o180.fit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-22167) Spark Packaging w/R distro issues

2017-09-29 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-22167:

Comment: was deleted

(was: [^build.log])

> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22167) Spark Packaging w/R distro issues

2017-09-29 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185956#comment-16185956
 ] 

holdenk commented on SPARK-22167:
-

[^build.log]

> Spark Packaging w/R distro issues
> -
>
> Key: SPARK-22167
> URL: https://issues.apache.org/jira/browse/SPARK-22167
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SparkR
>Affects Versions: 2.1.2
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
>
> The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the 
> R directory was missing from the hadoop-2.7 bin distro. This is the version 
> we build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22167) Spark Packaging w/R distro issues

2017-09-29 Thread holdenk (JIRA)

holdenk created SPARK-22167:
---

 Summary: Spark Packaging w/R distro issues
 Key: SPARK-22167
 URL: https://issues.apache.org/jira/browse/SPARK-22167
 Project: Spark
  Issue Type: Bug
  Components: Build, SparkR
Affects Versions: 2.1.2
Reporter: holdenk
Assignee: holdenk
Priority: Blocker


The Spark packaging for Spark R in 2.1.2 did not work as expected, namely the R 
directory was missing from the hadoop-2.7 bin distro. This is the version we 
build the PySpark package for so it's possible this is related.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22138) Allow retry during release-build

2017-09-29 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-22138.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19359
[https://github.com/apache/spark/pull/19359]

> Allow retry during release-build
> 
>
> Key: SPARK-22138
> URL: https://issues.apache.org/jira/browse/SPARK-22138
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0
> Environment: Right now the build script is configured with no 
> retries, but since transient issues exist with networking lets allow a small 
> number of retries.
>Reporter: holdenk
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22129) Spark release scripts ignore the GPG_KEY and always sign with your default key

2017-09-29 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-22129.
-
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19359
[https://github.com/apache/spark/pull/19359]

> Spark release scripts ignore the GPG_KEY and always sign with your default key
> --
>
> Key: SPARK-22129
> URL: https://issues.apache.org/jira/browse/SPARK-22129
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1, 2.3.0
>Reporter: holdenk
>Assignee: holdenk
>Priority: Blocker
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> Currently the release scripts require GPG_KEY be specified but the param is 
> ignored and instead the default GPG key is used. Change this to sign with the 
> specified key.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185937#comment-16185937
 ] 

Liang-Chi Hsieh edited comment on SPARK-22137 at 9/29/17 3:04 PM:
--

Actually that is because we only allow to cast between {{UserDefinedType}} s 
and disallow to case between {{UserDefinedType}} and other data types.


was (Author: viirya):
Actually that is because we only allow to cast between {{UserDefinedType}} s 
and disallow {{UserDefinedType}} be casted to/from other data types.

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185937#comment-16185937
 ] 

Liang-Chi Hsieh edited comment on SPARK-22137 at 9/29/17 2:56 PM:
--

Actually that is because we only allow to cast between {{UserDefinedType}} s 
and disallow {{UserDefinedType}} be casted to/from other data types.


was (Author: viirya):
Actually that is because we only allow to cast between {{UserDefinedType}}s and 
disallow {{UserDefinedType}} be casted to/from other data types.

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Liang-Chi Hsieh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185937#comment-16185937
 ] 

Liang-Chi Hsieh commented on SPARK-22137:
-

Actually that is because we only allow to cast between {{UserDefinedType}}s and 
disallow {{UserDefinedType}} be casted to/from other data types.

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-09-29 Thread Shahbaz Hussain (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185856#comment-16185856
 ] 

Shahbaz Hussain commented on SPARK-18838:
-

Is there a work around to get it working before spark 2.3.0 releases.



> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Marcelo Vanzin
> Fix For: 2.3.0
>
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2017-09-29 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185744#comment-16185744
 ] 

Steve Loughran commented on SPARK-2356:
---

[~Vasilina], that probably means you're running with Hadoop <=2.7; the more 
helpful message only went in with HADOOP-10775. Sorry.

I'm about to close this as a duplicate of HADOOP-10775, as really it is a 
config problem (plus the need for the hadoop libs to have a copy of 
winutils.exe around for file operations)...all that can be done short of 
removing that dependency is fixing the error message, which we've done our best 
at.

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0, 1.1.1, 1.2.1, 1.2.2, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2017-09-29 Thread Steve Loughran (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran resolved SPARK-2356.
---
Resolution: Duplicate

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0, 1.1.1, 1.2.1, 1.2.2, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18935:


Assignee: (was: Apache Spark)

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2017-09-29 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185735#comment-16185735
 ] 

Apache Spark commented on SPARK-18935:
--

User 'skonto' has created a pull request for this issue:
https://github.com/apache/spark/pull/19390

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18935) Use Mesos "Dynamic Reservation" resource for Spark

2017-09-29 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18935:


Assignee: Apache Spark

> Use Mesos "Dynamic Reservation" resource for Spark
> --
>
> Key: SPARK-18935
> URL: https://issues.apache.org/jira/browse/SPARK-18935
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.0.2
>Reporter: jackyoh
>Assignee: Apache Spark
>
> I'm running spark on Apache Mesos
> Please follow these steps to reproduce the issue:
> 1. First, run Mesos resource reserve:
> curl -i -d slaveId=c24d1cfb-79f3-4b07-9f8b-c7b19543a333-S0 -d 
> resources='[{"name":"cpus","type":"SCALAR","scalar":{"value":20},"role":"spark","reservation":{"principal":""}},{"name":"mem","type":"SCALAR","scalar":{"value":4096},"role":"spark","reservation":{"principal":""}}]'
>  -X POST http://192.168.1.118:5050/master/reserve
> 2. Then run spark-submit command:
> ./spark-submit --class org.apache.spark.examples.SparkPi --master 
> mesos://192.168.1.118:5050 --conf spark.mesos.role=spark  
> ../examples/jars/spark-examples_2.11-2.0.2.jar 1
> And the console will keep loging same warning message as shown below: 
> 16/12/19 22:33:28 WARN TaskSchedulerImpl: Initial job has not accepted any 
> resources; check your cluster UI to ensure that workers are registered and 
> have sufficient resources



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-22166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185722#comment-16185722
 ] 

吴志龙 commented on SPARK-22166:
-

jvm data spill to disk throw exeception, I add exector-memory is ok . maybe 
quict write disk or Adjust a small proportion ,to reduce the exception

> java.lang.OutOfMemoryError: error while calling spill() 
> 
>
> Key: SPARK-22166
> URL: https://issues.apache.org/jira/browse/SPARK-22166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: spark 2.2
> hadoop 2.6.0
> jdk 1.8
>Reporter: 吴志龙
>
> ${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
> --driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
> --executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
> spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
> -XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""
> java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>  +details
> java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:161)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:378)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:402)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
> FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message= +details
> FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message=
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 3
>   at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
>   at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
>   at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at

[jira] [Commented] (SPARK-2356) Exception: Could not locate executable null\bin\winutils.exe in the Hadoop

2017-09-29 Thread Vasilina Terahava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185716#comment-16185716
 ] 

Vasilina Terahava commented on SPARK-2356:
--

as for parquet libraries in this case they print error 
" Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: 
HADOOP_HOME and hadoop.home.dir are unset. -see 
https://wiki.apache.org/hadoop/WindowsProblems;
In our case we see "Could not locate executable null\bin\winutils.exe in the 
Hadoop" with null which is not clear where the root cause from. Could we update 
the message at least?

> Exception: Could not locate executable null\bin\winutils.exe in the Hadoop 
> ---
>
> Key: SPARK-2356
> URL: https://issues.apache.org/jira/browse/SPARK-2356
> Project: Spark
>  Issue Type: Bug
>  Components: Windows
>Affects Versions: 1.0.0, 1.1.1, 1.2.1, 1.2.2, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 
> 1.5.1, 1.5.2
>Reporter: Kostiantyn Kudriavtsev
>Priority: Critical
>
> I'm trying to run some transformation on Spark, it works fine on cluster 
> (YARN, linux machines). However, when I'm trying to run it on local machine 
> (Windows 7) under unit test, I got errors (I don't use Hadoop, I'm read file 
> from local filesystem):
> {code}
> 14/07/02 19:59:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 14/07/02 19:59:31 ERROR Shell: Failed to locate the winutils binary in the 
> hadoop binary path
> java.io.IOException: Could not locate executable null\bin\winutils.exe in the 
> Hadoop binaries.
>   at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:318)
>   at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:333)
>   at org.apache.hadoop.util.Shell.(Shell.java:326)
>   at org.apache.hadoop.util.StringUtils.(StringUtils.java:76)
>   at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:93)
>   at org.apache.hadoop.security.Groups.(Groups.java:77)
>   at 
> org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:240)
>   at 
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:255)
>   at 
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:283)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.(SparkHadoopUtil.scala:36)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala:109)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$.(SparkHadoopUtil.scala)
>   at org.apache.spark.SparkContext.(SparkContext.scala:228)
>   at org.apache.spark.SparkContext.(SparkContext.scala:97)
> {code}
> It's happened because Hadoop config is initialized each time when spark 
> context is created regardless is hadoop required or not.
> I propose to add some special flag to indicate if hadoop config is required 
> (or start this configuration manually)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-22166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-22166:

Description: 
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""

java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
/home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
 +details
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
/home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:161)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:378)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:402)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)


FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message= +details
FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message=
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
location for shuffle 3
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
at 
org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
at 
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)

  was:
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""

ava.lang.OutOfMemoryError: error while calling spill() on

[jira] [Resolved] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22166.
---
Resolution: Invalid

This just means you ran out of memory.

> java.lang.OutOfMemoryError: error while calling spill() 
> 
>
> Key: SPARK-22166
> URL: https://issues.apache.org/jira/browse/SPARK-22166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: spark 2.2
> hadoop 2.6.0
> jdk 1.8
>Reporter: 吴志龙
>
> ${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
> --driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
> --executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
> spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
> -XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""
> java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>  +details
> java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:161)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:378)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:402)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>   at 
> org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
> FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message= +details
> FetchFailed(null, shuffleId=3, mapId=-1, reduceId=24, message=
> org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output 
> location for shuffle 3
>   at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:697)
>   at 
> org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:693)
>   at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
>   at 
> org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:693)
>   at 
> org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:147)
>   at 
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:49)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at

[jira] [Updated] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-22166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-22166:

Description: 
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""

ava.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
/home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
 +details
java.lang.OutOfMemoryError: error while calling spill() on 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
/home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
at 
org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:161)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:245)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
at 
org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:272)
at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:378)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:402)
at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at 
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)

  was:
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""


!http://example.com/image.png!


> java.lang.OutOfMemoryError: error while calling spill() 
> 
>
> Key: SPARK-22166
> URL: https://issues.apache.org/jira/browse/SPARK-22166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: spark 2.2
> hadoop 2.6.0
> jdk 1.8
>Reporter: 吴志龙
>
> ${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
> --driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
> --executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
> spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
> -XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""
> ava.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>  +details
> java.lang.OutOfMemoryError: error while calling spill() on 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter@1b813200 : 
> /home/fqlhadoop/datas/hadoop/tmp-hadoop-biadmin/nm-local-dir/usercache/biadmin/appcache/application_1504095691482_250304/blockmgr-3347e81a-150c-4dee-94a7-727494bf4fe4/0c/temp_local_08a15a87-0d7b-4055-bae7-cc511e48dbd8
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:161)
>   at 
>

[jira] [Updated] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-22166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

吴志龙 updated SPARK-22166:

Description: 
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""


!http://example.com/image.png!

  was:
${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""



> java.lang.OutOfMemoryError: error while calling spill() 
> 
>
> Key: SPARK-22166
> URL: https://issues.apache.org/jira/browse/SPARK-22166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
> Environment: spark 2.2
> hadoop 2.6.0
> jdk 1.8
>Reporter: 吴志龙
>
> ${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
> --driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
> --executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
> spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
> -XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
> -XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""
> !http://example.com/image.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-22166) java.lang.OutOfMemoryError: error while calling spill()

2017-09-29 Thread JIRA

吴志龙 created SPARK-22166:
---

 Summary: java.lang.OutOfMemoryError: error while calling spill() 
 Key: SPARK-22166
 URL: https://issues.apache.org/jira/browse/SPARK-22166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
 Environment: spark 2.2
hadoop 2.6.0
jdk 1.8
Reporter: 吴志龙


${SPARK_HOME}/bin/spark-sql --master=yarn --queue lx_etl --driver-memory 4g 
--driver-java-options -XX:MaxMetaspaceSize=512m --num-executors 12  
--executor-memory 3g  --hiveconf hive.cli.print.header=false --conf 
spark.executor.extraJavaOptions=" -Xmn768m -XX:+UseG1GC 
-XX:MaxMetaspaceSize=512m -XX:MaxGCPauseMillis=400 -XX:G1ReservePercent=30 
-XX:SoftRefLRUPolicyMSPerMB=0 -XX:InitiatingHeapOccupancyPercent=35" -e ""




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185563#comment-16185563
 ] 

Jia-Xuan Liu commented on SPARK-22137:
--

I guess the problem is we can't create a table with vector type manually.
[~viirya], What do you think about this?

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread yzheng616 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185554#comment-16185554
 ] 

yzheng616 commented on SPARK-22137:
---

yes, the API just could not work with the table that manually created in this 
case.

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185548#comment-16185548
 ] 

Jia-Xuan Liu edited comment on SPARK-22137 at 9/29/17 9:16 AM:
---

umm... It's still fail and looks like it's same as what you meet.

{code:java}
scala> val tdf = spark.table("test")
tdf: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> tdf.write.insertInto("table_udt")
org.apache.spark.sql.AnalysisException: cannot resolve 'test.`features`' due to 
data type mismatch: cannot cast org.apache.spark.ml.linalg.
VectorUDT@3bfc3ba7 to StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), StructField(indices,ArrayType(Integer
Type,true),true), StructField(values,ArrayType(DoubleType,true),true));;
'InsertIntoHadoopFsRelationCommand 
file:/home/xxx/git/spark/spark-warehouse/table_udt, false, Parquet, 
Map(mergeschema -> false), Append, C
atalogTable(
Database: default
Table: table_udt
Owner: xxx
Created Time: Fri Sep 29 17:07:52 CST 2017
Last Access: Thu Jan 01 08:00:00 CST 1970
Created By: Spark 2.3.0-SNAPSHOT
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1506676072]
Location: file:/home/xxx/git/spark/spark-warehouse/table_udt
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
 |-- id: integer (nullable = true)
 |-- features: struct (nullable = true)
 ||-- type: byte (nullable = true)
 ||-- size: integer (nullable = true)
 ||-- indices: array (nullable = true)
 |||-- element: integer (containsNull = true)
 ||-- values: array (nullable = true)
 |||-- element: double (containsNull = true)
), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@c1940c95
+- 'Project [cast(id#37L as int) AS id#66, cast(features#38 as 
struct) AS fe
atures#67]
   +- SubqueryAlias test
  +- Relation[id#37L,features#38] parquet

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:95)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:100)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:110)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(Que
ryPlan.scala:114)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:114)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:119)
  at

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185548#comment-16185548
 ] 

Jia-Xuan Liu edited comment on SPARK-22137 at 9/29/17 9:14 AM:
---

umm... It's still fail.

{code:java}
scala> val tdf = spark.table("test")
tdf: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> tdf.write.insertInto("table_udt")
org.apache.spark.sql.AnalysisException: cannot resolve 'test.`features`' due to 
data type mismatch: cannot cast org.apache.spark.ml.linalg.
VectorUDT@3bfc3ba7 to StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), StructField(indices,ArrayType(Integer
Type,true),true), StructField(values,ArrayType(DoubleType,true),true));;
'InsertIntoHadoopFsRelationCommand 
file:/home/xxx/git/spark/spark-warehouse/table_udt, false, Parquet, 
Map(mergeschema -> false), Append, C
atalogTable(
Database: default
Table: table_udt
Owner: xxx
Created Time: Fri Sep 29 17:07:52 CST 2017
Last Access: Thu Jan 01 08:00:00 CST 1970
Created By: Spark 2.3.0-SNAPSHOT
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1506676072]
Location: file:/home/xxx/git/spark/spark-warehouse/table_udt
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
 |-- id: integer (nullable = true)
 |-- features: struct (nullable = true)
 ||-- type: byte (nullable = true)
 ||-- size: integer (nullable = true)
 ||-- indices: array (nullable = true)
 |||-- element: integer (containsNull = true)
 ||-- values: array (nullable = true)
 |||-- element: double (containsNull = true)
), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@c1940c95
+- 'Project [cast(id#37L as int) AS id#66, cast(features#38 as 
struct) AS fe
atures#67]
   +- SubqueryAlias test
  +- Relation[id#37L,features#38] parquet

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:95)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:100)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:110)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(Que
ryPlan.scala:114)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:114)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:89)
  at

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185548#comment-16185548
 ] 

Jia-Xuan Liu edited comment on SPARK-22137 at 9/29/17 9:13 AM:
---

umm... It's still fail.

{code:java}
scala> val tdf = spark.table("test")
tdf: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> tdf.write.insertInto("table_udt")
org.apache.spark.sql.AnalysisException: cannot resolve 'test.`features`' due to 
data type mismatch: cannot cast org.apache.spark.ml.linalg.
VectorUDT@3bfc3ba7 to StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), StructField(indices,ArrayType(Integer
Type,true),true), StructField(values,ArrayType(DoubleType,true),true));;
'InsertIntoHadoopFsRelationCommand 
file:/home/jax/git/spark/spark-warehouse/table_udt, false, Parquet, 
Map(mergeschema -> false), Append, C
atalogTable(
Database: default
Table: table_udt
Owner: jax
Created Time: Fri Sep 29 17:07:52 CST 2017
Last Access: Thu Jan 01 08:00:00 CST 1970
Created By: Spark 2.3.0-SNAPSHOT
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1506676072]
Location: file:/home/xxx/git/spark/spark-warehouse/table_udt
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
 |-- id: integer (nullable = true)
 |-- features: struct (nullable = true)
 ||-- type: byte (nullable = true)
 ||-- size: integer (nullable = true)
 ||-- indices: array (nullable = true)
 |||-- element: integer (containsNull = true)
 ||-- values: array (nullable = true)
 |||-- element: double (containsNull = true)
), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@c1940c95
+- 'Project [cast(id#37L as int) AS id#66, cast(features#38 as 
struct) AS fe
atures#67]
   +- SubqueryAlias test
  +- Relation[id#37L,features#38] parquet

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:95)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:100)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:110)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(Que
ryPlan.scala:114)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:114)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:89)
  at

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185548#comment-16185548
 ] 

Jia-Xuan Liu commented on SPARK-22137:
--

umm... It's still fail.

{code:java}
scala> val tdf = spark.table("test")
tdf: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> tdf.write.insertInto("table_udt")
org.apache.spark.sql.AnalysisException: cannot resolve 'test.`features`' due to 
data type mismatch: cannot cast org.apache.spark.ml.linalg.
VectorUDT@3bfc3ba7 to StructType(StructField(type,ByteType,true), 
StructField(size,IntegerType,true), StructField(indices,ArrayType(Integer
Type,true),true), StructField(values,ArrayType(DoubleType,true),true));;
'InsertIntoHadoopFsRelationCommand 
file:/home/jax/git/spark/spark-warehouse/table_udt, false, Parquet, 
Map(mergeschema -> false), Append, C
atalogTable(
Database: default
Table: table_udt
Owner: jax
Created Time: Fri Sep 29 17:07:52 CST 2017
Last Access: Thu Jan 01 08:00:00 CST 1970
Created By: Spark 2.3.0-SNAPSHOT
Type: MANAGED
Provider: hive
Table Properties: [transient_lastDdlTime=1506676072]
Location: file:/home/jax/git/spark/spark-warehouse/table_udt
Serde Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties: [serialization.format=1]
Partition Provider: Catalog
Schema: root
 |-- id: integer (nullable = true)
 |-- features: struct (nullable = true)
 ||-- type: byte (nullable = true)
 ||-- size: integer (nullable = true)
 ||-- indices: array (nullable = true)
 |||-- element: integer (containsNull = true)
 ||-- values: array (nullable = true)
 |||-- element: double (containsNull = true)
), org.apache.spark.sql.execution.datasources.InMemoryFileIndex@c1940c95
+- 'Project [cast(id#37L as int) AS id#66, cast(features#38 as 
struct) AS fe
atures#67]
   +- SubqueryAlias test
  +- Relation[id#37L,features#38] parquet

  at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:95)
  at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:87)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:286)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:89)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:100)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:110)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(Que
ryPlan.scala:114)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:381)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.immutable.List.map(List.scala:285)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:114)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:119)
  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:89)
  at

[jira] [Comment Edited] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread yzheng616 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185535#comment-16185535
 ] 

yzheng616 edited comment on SPARK-22137 at 9/29/17 9:05 AM:


Have you tried to use DataFrameWriter.insertInto(tableName: String) API to 
insert data to the table table_udt? 


was (Author: yzheng616):
Have you tried to use DataFrameWriter.insertInto(tableName: String) API to 
insert data to the table? 

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread yzheng616 (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185535#comment-16185535
 ] 

yzheng616 commented on SPARK-22137:
---

Have you tried to use DataFrameWriter.insertInto(tableName: String) API to 
insert data to the table? 

> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
> struct) AS 
> features#28]
>+- LogicalRDD [id#13L, features#14]{color}
> Reproduce code:
> {code:java}
> import scala.annotation.varargs
> import org.apache.spark.ml.linalg.SQLDataTypes
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.types.LongType
> import org.apache.spark.sql.types.StructField
> import org.apache.spark.sql.types.StructType
> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
> object UDTTest {
>   def main(args: Array[String]): Unit = {
> val tb = "table_udt"
> val spark = 
> SparkSession.builder().master("local[4]").appName("UDTInsertInto").enableHiveSupport().getOrCreate()
> spark.sql("drop table if exists " + tb)
> 
> /*
>  * VectorUDT sql type definition:
>  * 
>  *   override def sqlType: StructType = {
>  *   StructType(Seq(
>  *StructField("type", ByteType, nullable = false),
>  *StructField("size", IntegerType, nullable = true),
>  *StructField("indices", ArrayType(IntegerType, containsNull = 
> false), nullable = true),
>  *StructField("values", ArrayType(DoubleType, containsNull = 
> false), nullable = true)))
>  *   }
> */
> 
> //Create Hive table base on VectorUDT sql type
> spark.sql("create table if not exists "+tb+"(id int, features 
> struct)" +
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
> for (x <- 1 to 2) {
>   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 0.21, 
> 0.44)))
> }
> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
> Row.fromSeq(Seq(x.id,x.features)) }
> val schema = StructType(Array(StructField("id", 
> LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
> val df = spark.createDataFrame(rowRDD, schema)
>  
> //insert into hive table
> df.write.insertInto(tb)
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22137) Failed to insert VectorUDT to hive table with DataFrameWriter.insertInto(tableName: String)

2017-09-29 Thread Jia-Xuan Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16185527#comment-16185527
 ] 

Jia-Xuan Liu commented on SPARK-22137:
--

just do some testing. I'm not really sure where is the problem.

{code:java}
scala> case class UDT(`id`: Long, `features`: org.apache.spark.ml.linalg.Vector)
defined class UDT

scala> spark.sql("create table if not exists table_udt " +
 | "(id int, features 
struct)" +
 | " row format serde 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
 | " stored as"+
 | " inputformat 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
 | " outputformat 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")

scala> spark.sql("describe table_udt").show()
+++---+
|col_name|   data_type|comment|
+++---+
|  id| int|   null|
|features|struct.
If we create a scala dataframe stroing mllib Vector and save it as hive table, 
it will be a vector, shown as below.

{code:java}

scala> var seq = new scala.collection.mutable.ArrayBuffer[UDT]()
seq: scala.collection.mutable.ArrayBuffer[UDT] = ArrayBuffer()

scala> for (x <- 1 to 2) {
 |   seq += (new UDT(x, org.apache.spark.ml.linalg.Vectors.dense(0.2, 
0.21, 0.44)))
 | }

scala> val rowRDD = (spark.sparkContext.makeRDD[UDT](seq)).map { x => 
Row.fromSeq(Seq(x.id,x.features)) }
rowRDD: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
MapPartitionsRDD[15] at map at :36

scala> val schema = StructType(Array(StructField("id", 
LongType,false),StructField("features", SQLDataTypes.VectorType,false)))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(id,LongType,false), 
StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,false))

scala> val df = spark.createDataFrame(rowRDD, schema)
df: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> df.write.saveAsTable("test")

scala> spark.sql("select * from test").show()
+---+---+
| id|   features|
+---+---+
|  1|[0.2,0.21,0.44]|
|  2|[0.2,0.21,0.44]|
+---+---+

scala> spark.sql("describe test").show()
++-+---+
|col_name|data_type|comment|
++-+---+
|  id|   bigint|   null|
|features|   vector|   null|
++-+---+

{code}
 
and this table can be insertInto by itself.

{code:java}

scala> val tdf = spark.table("test")
tdf: org.apache.spark.sql.DataFrame = [id: bigint, features: vector]

scala> tdf.write.insertInto("test")

scala> tdf.show()
+---+---+
| id|   features|
+---+---+
|  1|[0.2,0.21,0.44]|
|  2|[0.2,0.21,0.44]|
|  1|[0.2,0.21,0.44]|
|  2|[0.2,0.21,0.44]|
+---+---+

{code}

and I also try to create table with vector type but it's fail.
Maybe vector type isn't a public type.

{code:java}

scala> spark.sql("create table if not exists table_udt " +
 | "(id int, features vector)" +
 | " row format serde 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
 | " stored as"+
 | " inputformat 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
 | " outputformat 
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
org.apache.spark.sql.catalyst.parser.ParseException:
DataType vector is not supported.(line 1, pos 54)

{code}


> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String)
> ---
>
> Key: SPARK-22137
> URL: https://issues.apache.org/jira/browse/SPARK-22137
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: yzheng616
>
> Failed to insert VectorUDT to hive table with 
> DataFrameWriter.insertInto(tableName: String). The issue seems similar with 
> SPARK-17765 which have been resolved in 2.1.0. 
> Error message: 
> {color:red}Exception in thread "main" org.apache.spark.sql.AnalysisException: 
> cannot resolve '`features`' due to data type mismatch: cannot cast 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 to 
> StructType(StructField(type,ByteType,true), 
> StructField(size,IntegerType,true), 
> StructField(indices,ArrayType(IntegerType,true),true), 
> StructField(values,ArrayType(DoubleType,true),true));;
> 'InsertIntoTable Relation[id#21,features#22] parquet, 
> OverwriteOptions(false,Map()), false
> +- 'Project [cast(id#13L as int) AS id#27, cast(features#14 as 
>

[jira] [Updated] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21893:
--
Labels: releasenotes  (was: )

> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.
> Update: it'll also require removing 0.8.x examples, because otherwise the 
> example module has to be split.
> While not necessarily connected, it's probably a decent point to declare 0.8 
> deprecated. And that means declaring 0.10 (the other API left) as stable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22142) Move Flume support behind a profile

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22142.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19365
[https://github.com/apache/spark/pull/19365]

> Move Flume support behind a profile
> ---
>
> Key: SPARK-22142
> URL: https://issues.apache.org/jira/browse/SPARK-22142
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> Kafka 0.8 support was recently put behind a profile. YARN, Mesos, Kinesis, 
> Docker-related integration are behind profiles. Flume support seems like it 
> could as well, to make it opt-in for builds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22142) Move Flume support behind a profile

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-22142:
--
Labels: releasenotes  (was: )

> Move Flume support behind a profile
> ---
>
> Key: SPARK-22142
> URL: https://issues.apache.org/jira/browse/SPARK-22142
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
>  Labels: releasenotes
> Fix For: 2.3.0
>
>
> Kafka 0.8 support was recently put behind a profile. YARN, Mesos, Kinesis, 
> Docker-related integration are behind profiles. Flume support seems like it 
> could as well, to make it opt-in for builds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22157) The uniux_timestamp method handles the time field that is lost in mill

2017-09-29 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-22157.
---
Resolution: Not A Problem

> The uniux_timestamp method handles the time field that is lost in mill
> --
>
> Key: SPARK-22157
> URL: https://issues.apache.org/jira/browse/SPARK-22157
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: hantiantian
>
> 1、create table test,and execute the flowing command
> select s1 from test;
> result:  2014-10-10 19:30:10.222
> 2、when use the native unix_timestamp method, and execute the flowing command
> select unix_timestamp(s1,"-MM-dd HH:mm:ss.SSS") from test;
> result:  1412940610
> Obviously，the mill part of the time field has been lost.
> 3、After repair，execute the command again
> select unix_timestamp(s1,"-MM-dd HH:mm:ss.SSS") from test;
> result:  1412940610.222
> Conclusion：After repair, we can keep the the mill part of the time field.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22146:

Fix Version/s: 2.3.0

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
> Fix For: 2.3.0
>
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22146) FileNotFoundException while reading ORC files containing '%'

2017-09-29 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-22146.
-
Resolution: Fixed
  Assignee: Marco Gaido

> FileNotFoundException while reading ORC files containing '%'
> 
>
> Key: SPARK-22146
> URL: https://issues.apache.org/jira/browse/SPARK-22146
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> Reading ORC files containing "strange" characters like '%' fails with a 
> FileNotFoundException.
> For instance, if you have:
> {noformat}
> /tmp/orc_test/folder %3Aa/orc1.orc
> /tmp/orc_test/folder %3Ab/orc2.orc
> {noformat}
> and you try to read the ORC files with:
> {noformat}
> spark.read.format("orc").load("/tmp/orc_test/*/*").show
> {noformat}
> you will get a:
> {noformat}
> java.io.FileNotFoundException: File 
> file:/tmp/orc_test/folder%20%253Aa/orc1.orc does not exist
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.listLeafStatuses(SparkHadoopUtil.scala:194)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.listOrcFiles(OrcFileOperator.scala:94)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.getFileReader(OrcFileOperator.scala:67)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$$anonfun$readSchema$1.apply(OrcFileOperator.scala:77)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileOperator$.readSchema(OrcFileOperator.scala:77)
>   at 
> org.apache.spark.sql.hive.orc.OrcFileFormat.inferSchema(OrcFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:197)
>   at scala.Option.orElse(Option.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:196)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:387)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:168)
>   ... 48 elided
> {noformat}
> Note that the same code works for Parquet and text files.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

90 matches

Mail list logo