[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124731#comment-16124731
 ] 

Marcelo Vanzin commented on SPARK-18085:


[~duyanghao] that should all be explained in the document attached to this bug. 
I encourage you to read it if you're looking for details, or take a look at the 
work-in-progress code linked in many comments above. You're also welcome to run 
the code against your event logs and report any problems.

Note that no part of this work is about speeding up the loading of logs; 
loading an event log from scratch will most probably become slower now, when 
writing data to disk. The goal here is to control memory usage of the SHS, and 
to only have to process event logs once.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21709) use sbt 0.13.16 and update sbt plugins

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21709:
-

Assignee: PJ Fanning

> use sbt 0.13.16 and update sbt plugins
> --
>
> Key: SPARK-21709
> URL: https://issues.apache.org/jira/browse/SPARK-21709
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Assignee: PJ Fanning
>Priority: Minor
> Fix For: 2.3.0
>
>
> A preliminary step to SPARK-21708.
> Quite a lot of sbt plugin changes needed to get to full sbt 1.0.0 support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21709) use sbt 0.13.16 and update sbt plugins

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21709.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18921
[https://github.com/apache/spark/pull/18921]

> use sbt 0.13.16 and update sbt plugins
> --
>
> Key: SPARK-21709
> URL: https://issues.apache.org/jira/browse/SPARK-21709
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.3.0
>Reporter: PJ Fanning
>Priority: Minor
> Fix For: 2.3.0
>
>
> A preliminary step to SPARK-21708.
> Quite a lot of sbt plugin changes needed to get to full sbt 1.0.0 support.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21716) The time-range window can't be applied on the reduce operator

2017-08-12 Thread Fan Donglai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Donglai updated SPARK-21716:

Summary:  The time-range window can't be applied on the reduce operator  
(was:  The time-range window can't be applid on the reduce operator)

>  The time-range window can't be applied on the reduce operator
> --
>
> Key: SPARK-21716
> URL: https://issues.apache.org/jira/browse/SPARK-21716
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Fan Donglai
>
> I can't use GroupBy + Window operator to get the newest(the maximum event 
> time) row in a window.It should make the window can be applid on the reduce 
> operator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21716) The time-range window can't be applid on the reduce operator

2017-08-12 Thread Fan Donglai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fan Donglai updated SPARK-21716:

Description: I can't use GroupBy + Window operator to get the newest(the 
maximum event time) row in a window.It should make the window can be applid on 
the reduce operator  (was: I can't use GroupBy + Window operator to get the 
newest(the maximum event time) row in a window.So pls make the window can be 
applid on the reduce operator)

>  The time-range window can't be applid on the reduce operator
> -
>
> Key: SPARK-21716
> URL: https://issues.apache.org/jira/browse/SPARK-21716
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Fan Donglai
>
> I can't use GroupBy + Window operator to get the newest(the maximum event 
> time) row in a window.It should make the window can be applid on the reduce 
> operator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21716) The time-range window can't be applid on the reduce operator

2017-08-12 Thread Fan Donglai (JIRA)
Fan Donglai created SPARK-21716:
---

 Summary:  The time-range window can't be applid on the reduce 
operator
 Key: SPARK-21716
 URL: https://issues.apache.org/jira/browse/SPARK-21716
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Fan Donglai


I can't use GroupBy + Window operator to get the newest(the maximum event time) 
row in a window.So pls make the window can be applid on the reduce operator



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2017-08-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803
 ] 

Valeriy Avanesov edited comment on SPARK-5564 at 8/12/17 10:29 AM:
---

I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.

[~josephkb], what are your thoughts?


was (Author: acopich):
I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.



> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21691) Accessing canonicalized plan for query with limit throws exception

2017-08-12 Thread Anton Okolnychyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124532#comment-16124532
 ] 

Anton Okolnychyi commented on SPARK-21691:
--

The issue is related to `Project \[\*\]` but not to `Limit`. You will always 
have this exception if you have a non-top level `Project \[\*\]` in your 
logical plan. For instance, the following query will also produce the same 
exception:

{noformat}
spark.sql("select * from (select * from (values 0, 1)) as 
v").queryExecution.logical.canonicalized
{noformat}

In the failing example from the ticket description, the non-canonicalized 
logical plan looks like:

{noformat}
'GlobalLimit 1
+- 'LocalLimit 1
   +- 'Project [*]
  +- 'SubqueryAlias __auto_generated_subquery_name
 +- 'UnresolvedInlineTable [col1], [List(0), List(1)]
{noformat}

Once Spark tries to canonicalize it and processes `LocalLimit 1`, it will get 
all attributes by calling `children.flatMap(_.output)`, which triggers the 
problem. `Project#output` will try to convert its project list to attributes, 
which will fail for `UnresolvedStar` with the aforementioned exception.

I see that `UnresolvedRelation` and `UnresolvedInlineTable` return Nil as 
output. Therefore, one option to fix this problem is to return `Nil` as output 
from `Project` if it is unresolved.

{noformat}
override def output: Seq[Attribute] = if (resolved) 
projectList.map(_.toAttribute) else Nil
{noformat}

I can fix it once we agree on a solution.


> Accessing canonicalized plan for query with limit throws exception
> --
>
> Key: SPARK-21691
> URL: https://issues.apache.org/jira/browse/SPARK-21691
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bjoern Toldbod
>
> Accessing the logical, canonicalized plan fails for queries with limits.
> The following demonstrates the issue:
> {code:java}
> val session = SparkSession.builder.master("local").getOrCreate()
> // This works
> session.sql("select * from (values 0, 
> 1)").queryExecution.logical.canonicalized
> // This fails
> session.sql("select * from (values 0, 1) limit 
> 1").queryExecution.logical.canonicalized
> {code}
> The message in the thrown exception is somewhat confusing (or at least not 
> directly related to the limit):
> "Invalid call to toAttribute on unresolved object, tree: *"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-08-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124531#comment-16124531
 ] 

Valeriy Avanesov commented on SPARK-14371:
--

Hi,

I've opened a PR regarding this Jira yesterday
https://github.com/apache/spark/pull/18924

However, something seems to be wrong -- the Jira is still not "in Progress" and 
the PR is not linked to it. Could anyone please check out what's wrong? 

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21711) spark-submit command should accept log4j configuration parameters for spark client logging.

2017-08-12 Thread Mahesh Ambule (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124529#comment-16124529
 ] 

Mahesh Ambule commented on SPARK-21711:
---

Sean Owen: I figured out a way to configure the log4j  configuration for 
spark-client/launcher.  The $SPARK_SUBMIT_OPTS environment variable can be set 
to include log4j configuration. $SPARK_SUBMIT_OPTS gets appended to command 
line options when "org.apache.spark.deploy.SparkSubmit" is invoked.

export SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS 
-Dlog4j.configuration=/home/mahesh/log4j.config"

Link for the relevant code:
https://github.com/apache/spark/blob/branch-2.1/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L242

> spark-submit command should accept log4j configuration parameters for spark 
> client logging.
> ---
>
> Key: SPARK-21711
> URL: https://issues.apache.org/jira/browse/SPARK-21711
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Mahesh Ambule
>Priority: Minor
> Attachments: spark-submit client logs.txt
>
>
> Currently, log4j properties can be specified in spark 'conf' directory in 
> log4j.properties file.
> The spark-submit command can override these log4j properties for driver and 
> executors. 
> But it can not override these log4j properties for *spark client * 
> application.
> The user should be able to pass log4j properties for spark client using the 
> spark-submit command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21655) Kill CLI for Yarn mode

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21655:
--
Target Version/s:   (was: 2.1.1)
   Fix Version/s: (was: 2.1.1)

> Kill CLI for Yarn mode
> --
>
> Key: SPARK-21655
> URL: https://issues.apache.org/jira/browse/SPARK-21655
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.1.1
>Reporter: Jong Yoon Lee
>Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Similar to how standalone and Mesos have the capability to safely shut down 
> the spark application, there should be a way to safely shut down spark on 
> Yarn mode. This will ensure a clean shutdown and unregistration from yarn.
> This is the design doc:
> https://docs.google.com/document/d/1QG8hITjLNi1D9dVR3b_hZkyrGm5FFm0u9M1KGM4y1Ak/edit?usp=sharing
> and I will upload the patch soon



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21715:
--
Affects Version/s: (was: 2.3.0)
 Target Version/s:   (was: 2.3.0)
Fix Version/s: (was: 2.3.0)

> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Priority: Minor
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.
> !https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!
> !https://issues.apache.org/jira/secure/attachment/12881535/ResponseContent.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21694) Support Mesos CNI network labels

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21694:
--
 Priority: Minor  (was: Major)
Fix Version/s: (was: 2.3.0)

> Support Mesos CNI network labels
> 
>
> Key: SPARK-21694
> URL: https://issues.apache.org/jira/browse/SPARK-21694
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.2.0
>Reporter: Susan X. Huynh
>Priority: Minor
>
> Background: SPARK-18232 added the ability to launch containers attached to a 
> CNI network by specifying the network name via `spark.mesos.network.name`.
> This ticket is to allow the user to pass network labels to CNI plugins. More 
> details in the related Mesos documentation: 
> http://mesos.apache.org/documentation/latest/cni/#mesos-meta-data-to-cni-plugins



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21715) History Server respondes history page html content multiple times for only one http request

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21715:
--
Issue Type: Improvement  (was: Bug)

> History Server respondes history page html content multiple times for only 
> one http request
> ---
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Priority: Minor
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.
> !https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!
> !https://issues.apache.org/jira/secure/attachment/12881535/ResponseContent.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21687) Spark SQL should set createTime for Hive partition

2017-08-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21687:
--
Target Version/s:   (was: 2.3.0)
Priority: Minor  (was: Major)
   Fix Version/s: (was: 2.3.0)
  Issue Type: Improvement  (was: Bug)

> Spark SQL should set createTime for Hive partition
> --
>
> Key: SPARK-21687
> URL: https://issues.apache.org/jira/browse/SPARK-21687
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Chaozhong Yang
>Priority: Minor
>
> In Spark SQL, we often use `insert overwite table t partition(p=xx)` to 
> create partition for partitioned table. `createTime` is an important 
> information to manage data lifecycle, e.g TTL.
> However, we found that Spark SQL doesn't call setCreateTime in 
> `HiveClientImpl#toHivePartition` as follows:
> {code:scala}
> def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> val tpart = new org.apache.hadoop.hive.metastore.api.Partition
> val partValues = ht.getPartCols.asScala.map { hc =>
>   p.spec.get(hc.getName).getOrElse {
> throw new IllegalArgumentException(
>   s"Partition spec is missing a value for column '${hc.getName}': 
> ${p.spec}")
>   }
> }
> val storageDesc = new StorageDescriptor
> val serdeInfo = new SerDeInfo
> 
> p.storage.locationUri.map(CatalogUtils.URIToString(_)).foreach(storageDesc.setLocation)
> p.storage.inputFormat.foreach(storageDesc.setInputFormat)
> p.storage.outputFormat.foreach(storageDesc.setOutputFormat)
> p.storage.serde.foreach(serdeInfo.setSerializationLib)
> serdeInfo.setParameters(p.storage.properties.asJava)
> storageDesc.setSerdeInfo(serdeInfo)
> tpart.setDbName(ht.getDbName)
> tpart.setTableName(ht.getTableName)
> tpart.setValues(partValues.asJava)
> tpart.setSd(storageDesc)
> new HivePartition(ht, tpart)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-12 Thread Peter Knight (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124496#comment-16124496
 ] 

Peter Knight commented on SPARK-17025:
--


Thank you for your e-mail. I am on holiday until Monday 21st August when I will 
try to deal with your request.


Pete

Dr Peter Knight
Sr Staff Analytics Engineer| UK Data Science | Digital Services Solutions
GE Aviation

T: +44 (0)23 8024 7237 | 
LinkedIn
WebEx: https://emeetings.webex.com/meet/pr108008065 | Telecon: 4090615# (dial 
numbers 
here)


AutoExtReply


> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2017-08-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124493#comment-16124493
 ] 

Joseph K. Bradley commented on SPARK-17025:
---

[~nchammas] I just merged https://github.com/apache/spark/pull/1 which 
should make this work if the custom Transformer uses simple (JSON-serializable) 
Params to store all of its data.  Does it meet your use case?  I'd like to make 
it easier to implement ML persistence for fancier data types in Transformers 
and Models (like Vectors or DataFrames) in the future, but hopefully this 
unblocks some use cases for now.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org