date:20170214

[jira] [Created] (SPARK-19605) Fail it if existing resource is not enough to run streaming job

2017-02-14 Thread Genmao Yu (JIRA)

Genmao Yu created SPARK-19605:
-

 Summary: Fail it if existing resource is not enough to run 
streaming job
 Key: SPARK-19605
 URL: https://issues.apache.org/jira/browse/SPARK-19605
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.1.0, 2.0.2
Reporter: Genmao Yu
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists

2017-02-14 Thread Eyal Zituny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867409#comment-15867409
 ] 

Eyal Zituny commented on SPARK-19594:
-

sure, i can do that, will it make sense to fix it by marking the query id as 
terminated instead of removing it from the list?

> StreamingQueryListener fails to handle QueryTerminatedEvent if more then one 
> listeners exists
> -
>
> Key: SPARK-19594
> URL: https://issues.apache.org/jira/browse/SPARK-19594
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Eyal Zituny
>Priority: Minor
>
> reproduce:
> *create a spark session
> *add multiple streaming query listeners
> *create a simple query
> *stop the query
> result -> only the first listener handle the QueryTerminatedEvent
> this might happen because the query run id is being removed from 
> activeQueryRunIds once the onQueryTerminated is called 
> (StreamingQueryListenerBus:115)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-14 Thread Song Jun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867398#comment-15867398
 ] 

Song Jun commented on SPARK-19598:
--

OK~ I'd like to do this. Thank you very much!

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos

2017-02-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867374#comment-15867374
 ] 

Abel Rincón commented on SPARK-16742:
-

Hi all we are working on a solution with hadoop delegation tokens and without 
proxy users, I hope that today you can take a look over the new code.

> Kerberos support for Spark on Mesos
> ---
>
> Key: SPARK-16742
> URL: https://issues.apache.org/jira/browse/SPARK-16742
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Reporter: Michael Gummelt
>
> We at Mesosphere have written Kerberos support for Spark on Mesos.  We'll be 
> contributing it to Apache Spark soon.
> Mesosphere design doc: 
> https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6
> Mesosphere code: 
> https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Yun Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867366#comment-15867366
 ] 

Yun Ni commented on SPARK-18392:


I agree with Seth. We need to first finish SPARK-18080 and SPARK-18450 before 
implementing new LSH subclasses. (SPARK-18454 could be a prerequisite as well 
but I am not sure)

[~merlin] Once the prerequisites are satisfied, you can take either BitSampling 
or SignRandomProjection and I can take the other one.

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Comment Edited] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api

2017-02-14 Thread Navya Krishnappa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867357#comment-15867357
 ] 

Navya Krishnappa edited comment on SPARK-19442 at 2/15/17 7:04 AM:
---

Thank you [~hyukjin.kwon]. 

It is working as per my requirement. I could create a new column with blank 
values. :) 


was (Author: navya krishnappa):
Thank you [~hyukjin.kwon]. 

It is satisfied my requirement. I could create a new column with blank values. 
:) 

> Unable to add column to the dataset using Dataset.WithColumn() api
> --
>
> Key: SPARK-19442
> URL: https://issues.apache.org/jira/browse/SPARK-19442
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When I'm creating a new column using Dataset.WithColumn() api, Analysis 
> Exception is thrown.
> Dataset.WithColumn() api: 
> Dataset.withColumn("newColumnName', new 
> org.apache.spark.sql.Column("newColumnName").cast("int"));
> Stacktrace: 
> cannot resolve '`NewColumn`' given input columns: [abc,xyz ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api

2017-02-14 Thread Navya Krishnappa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867357#comment-15867357
 ] 

Navya Krishnappa commented on SPARK-19442:
--

Thank you [~hyukjin.kwon]. 

It is satisfied my requirement. I could create a new column with blank values. 
:) 

> Unable to add column to the dataset using Dataset.WithColumn() api
> --
>
> Key: SPARK-19442
> URL: https://issues.apache.org/jira/browse/SPARK-19442
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When I'm creating a new column using Dataset.WithColumn() api, Analysis 
> Exception is thrown.
> Dataset.WithColumn() api: 
> Dataset.withColumn("newColumnName', new 
> org.apache.spark.sql.Column("newColumnName").cast("int"));
> Stacktrace: 
> cannot resolve '`NewColumn`' given input columns: [abc,xyz ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19604) Log the start of every Python test

2017-02-14 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-19604:


Assignee: Yin Huai

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19593) Records read per each kinesis transaction

2017-02-14 Thread Sarath Chandra Jiguru (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867214#comment-15867214
 ] 

Sarath Chandra Jiguru edited comment on SPARK-19593 at 2/15/17 5:52 AM:


Type of the ticket is question. In KinesisReceiver.scala, currently it is 
possible to use default values of KinesisClientLibConfiguration. Due to this, 
even the stream is capable of serving the required read rate, the kinesis spark 
streaming consumer is not able to use it.

Don't you think, it is highly needed to allow spark consumers to tweak these 
configurations?


was (Author: sarathjiguru):
See the Type of the ticket is question. In KinesisReceiver.scala, currently it 
is possible to use default values of KinesisClientLibConfiguration. Due to 
this, even the stream is capable of serving the required read rate, the kinesis 
spark streaming consumer is not able to use it.

Don't you think, it is highly needed to allow spark consumers to tweak these 
configurations?

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42M/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api

2017-02-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867277#comment-15867277
 ] 

Hyukjin Kwon commented on SPARK-19442:
--

ping [~Navya Krishnappa], would this satisfy your demand? 

> Unable to add column to the dataset using Dataset.WithColumn() api
> --
>
> Key: SPARK-19442
> URL: https://issues.apache.org/jira/browse/SPARK-19442
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.2
>Reporter: Navya Krishnappa
>
> When I'm creating a new column using Dataset.WithColumn() api, Analysis 
> Exception is thrown.
> Dataset.WithColumn() api: 
> Dataset.withColumn("newColumnName', new 
> org.apache.spark.sql.Column("newColumnName").cast("int"));
> Stacktrace: 
> cannot resolve '`NewColumn`' given input columns: [abc,xyz ]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19604) Log the start of every Python test

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19604:


Assignee: Apache Spark

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19593) Records read per each kinesis transaction

2017-02-14 Thread Sarath Chandra Jiguru (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867214#comment-15867214
 ] 

Sarath Chandra Jiguru commented on SPARK-19593:
---

See the Type of the ticket is question. In KinesisReceiver.scala, currently it 
is possible to use default values of KinesisClientLibConfiguration. Due to 
this, even the stream is capable of serving the required read rate, the kinesis 
spark streaming consumer is not able to use it.

Don't you think, it is highly needed to allow spark consumers to tweak these 
configurations?

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42M/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19604) Log the start of every Python test

2017-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867212#comment-15867212
 ] 

Apache Spark commented on SPARK-19604:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/16935

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19604) Log the start of every Python test

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19604:


Assignee: (was: Apache Spark)

> Log the start of every Python test
> --
>
> Key: SPARK-19604
> URL: https://issues.apache.org/jira/browse/SPARK-19604
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.1.0
>Reporter: Yin Huai
>
> Right now, we only have info level log after we finish the tests of a Python 
> test file. We should also log the start of a test. So, if a test is hanging, 
> we can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19604) Log the start of every Python test

2017-02-14 Thread Yin Huai (JIRA)

Yin Huai created SPARK-19604:


 Summary: Log the start of every Python test
 Key: SPARK-19604
 URL: https://issues.apache.org/jira/browse/SPARK-19604
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.1.0
Reporter: Yin Huai


Right now, we only have info level log after we finish the tests of a Python 
test file. We should also log the start of a test. So, if a test is hanging, we 
can tell which test file is running.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction

2017-02-14 Thread Sarath Chandra Jiguru (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Chandra Jiguru updated SPARK-19593:
--
Description: 
The question is related to spark streaming+kinesis integration

Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
records read per each transaction etc. 

Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
read only 100MB/minute, as I am not able to increase the default number of 
records in each transaction.

I have raised a question in stackoverflow as well, please look into it:
http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming

Kinesis stream setup:
open shards: 24
write rate: 440K/minute
read rate: 1.42M/minute
read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
Shards*2 MB*60 Seconds)

Reference: 
http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html


  was:
The question is related to spark streaming+kinesis integration

Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
records read per each transaction etc. 

Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
read only 100MB/minute, as I am not able to increase the default number of 
records in each transaction.

I have raised a question in stackoverflow as well, please look into it:
http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming

Kinesis stream setup:
open shards: 24
write rate: 440K/minute
read rate: 1.42K/minute
read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
Shards*2 MB*60 Seconds)

Reference: 
http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42M/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-02-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867161#comment-15867161
 ] 

Marcelo Vanzin commented on SPARK-19556:


We don't generally assign bugs. Leaving a message should be enough in case 
anyone else was also thinking about working on it.

> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on

2017-02-14 Thread Genmao Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867152#comment-15867152
 ] 

Genmao Yu commented on SPARK-19556:
---

[~vanzin] I am working on this, could you please assign it to me?


> Broadcast data is not encrypted when I/O encryption is on
> -
>
> Key: SPARK-19556
> URL: https://issues.apache.org/jira/browse/SPARK-19556
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: Marcelo Vanzin
>
> {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to 
> write and read data:
> {code}
>   if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(s"Failed to store $pieceId of $broadcastId 
> in local BlockManager")
>   }
> {code}
> {code}
>   bm.getLocalBytes(pieceId) match {
> case Some(block) =>
>   blocks(pid) = block
>   releaseLock(pieceId)
> case None =>
>   bm.getRemoteBytes(pieceId) match {
> case Some(b) =>
>   if (checksumEnabled) {
> val sum = calcChecksum(b.chunks(0))
> if (sum != checksums(pid)) {
>   throw new SparkException(s"corrupt remote block $pieceId of 
> $broadcastId:" +
> s" $sum != ${checksums(pid)}")
> }
>   }
>   // We found the block from remote executors/driver's 
> BlockManager, so put the block
>   // in this executor's BlockManager.
>   if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, 
> tellMaster = true)) {
> throw new SparkException(
>   s"Failed to store $pieceId of $broadcastId in local 
> BlockManager")
>   }
>   blocks(pid) = b
> case None =>
>   throw new SparkException(s"Failed to get $pieceId of 
> $broadcastId")
>   }
>   }
> {code}
> The thing these block manager methods have in common is that they bypass the 
> encryption code; so broadcast data is stored unencrypted in the block 
> manager, causing unencrypted data to be written to disk if those blocks need 
> to be evicted from memory.
> The correct fix here is actually not to change {{TorrentBroadcast}}, but to 
> fix the block manager so that:
> - data stored in memory is not encrypted
> - data written to disk is encrypted
> This would simplify the code paths that use BlockManager / SerializerManager 
> APIs (e.g. see SPARK-19520), but requires some tricky changes inside the 
> BlockManager to still be able to use file channels to avoid reading whole 
> blocks back into memory so they can be decrypted.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop

2017-02-14 Thread xukun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867124#comment-15867124
 ] 

xukun commented on SPARK-18113:
---

[~aash]

According my scenario and [https://github.com/palantir/spark/pull/94] code

task 678.0
outputCommitCoordinator.canCommit will match 
CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) =>  
CommitState(attemptNumber, System.nanoTime(), MidCommit)

outputCommitCoordinator.commitDone match CommitState(existingCommitter, 
startTime, MidCommit) if attemptNumber == existingCommitter =>
 CommitState(attemptNumber, startTime, Committed)

task 678.1
outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, 
Committed) 

then driver enter into task deadloop

> Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
> ---
>
> Key: SPARK-18113
> URL: https://issues.apache.org/jira/browse/SPARK-18113
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.1
> Environment: # cat /etc/redhat-release 
> Red Hat Enterprise Linux Server release 7.2 (Maipo)
>Reporter: xuqing
>Assignee: jin xing
> Fix For: 2.2.0
>
>
> Executor sends *AskPermissionToCommitOutput* to driver failed, and retry 
> another sending. Driver receives 2 AskPermissionToCommitOutput messages and 
> handles them. But executor ignores the first response(true) and receives the 
> second response(false). The TaskAttemptNumber for this partition in 
> authorizedCommittersByStage is locked forever. Driver enters into infinite 
> loop.
> h4. Driver Log:
> {noformat}
> 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID 
> 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 0
> ...
> 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 0
> ...
> 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID 
> 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, 
> cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, 
> partition: 24, attemptNumber: 1
> 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, 
> stage: 2, partition: 24, attempt: 1
> ...
> 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 
> (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes)
> ...
> {noformat}
> h4. Executor Log:
> {noformat}
> ...
> 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110)
> ...
> 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = 
> AskPermissionToCommitOutput(2,24,0)] in 1 attempts
> org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 
> seconds]. This timeout is controlled by spark.rpc.askTimeout
> at 
> org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
> at 
> org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95)
> at 
> org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73)
> at 
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212)
> at 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Mingjie Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867064#comment-15867064
 ] 

Mingjie Tang commented on SPARK-18392:
--

Sure, AND-amp is important and basic for current LSH. We can allocate the 
efforts for AND-amplification and new hash functions. \cc [~yunn] 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread Seth Hendrickson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867049#comment-15867049
 ] 

Seth Hendrickson commented on SPARK-18392:
--

I would pretty strongly prefer to focus on adding AND-amplification before 
adding anything else to LSH. That is more of a missing part of the 
functionality, where as other things are enhancements. Curious to hear others' 
thoughts on this. 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail:

[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

2017-02-14 Thread mingjie tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867041#comment-15867041
 ] 

mingjie tang commented on SPARK-18392:
--

[~yunn] are you working on the BitSampling & SignRandomProjection function, if 
not, I can work on them this week. 

> LSH API, algorithm, and documentation follow-ups
> 
>
> Key: SPARK-18392
> URL: https://issues.apache.org/jira/browse/SPARK-18392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA summarizes discussions from the initial LSH PR 
> [https://github.com/apache/spark/pull/15148] as well as the follow-up for 
> hash distance [https://github.com/apache/spark/pull/15800].  This will be 
> broken into subtasks:
> * API changes (targeted for 2.1)
> * algorithmic fixes (targeted for 2.1)
> * documentation improvements (ideally 2.1, but could slip)
> The major issues we have mentioned are as follows:
> * OR vs AND amplification
> ** Need to make API flexible enough to support both types of amplification in 
> the future
> ** Need to clarify which we support, including in each model function 
> (transform, similarity join, neighbors)
> * Need to clarify which algorithms we have implemented, improve docs and 
> references, and fix the algorithms if needed.
> These major issues are broken down into detailed issues below.
> h3. LSH abstraction
> * Rename {{outputDim}} to something indicative of OR-amplification.
> ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used 
> in the future for AND amplification (Thanks [~mlnick]!)
> * transform
> ** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This 
> is the "raw" output of all hash functions, i.e., with no aggregation for 
> amplification.
> ** Clarify meaning of output in terms of multiple hash functions and 
> amplification.
> ** Note: We will _not_ worry about users using this output for dimensionality 
> reduction; if anything, that use case can be explained in the User Guide.
> * Documentation
> ** Clarify terminology used everywhere
> *** hash function {{h_i}}: basic hash function without amplification
> *** hash value {{h_i(key)}}: output of a hash function
> *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
> AND-amplification using K base hash functions
> *** compound hash function value {{g(key)}}: vector-valued output
> *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
> OR-amplification using L compound hash functions
> *** hash table value {{H(key)}}: output of array of vectors
> *** This terminology is largely pulled from Wang et al.'s survey and the 
> multi-probe LSH paper.
> ** Link clearly to documentation (Wikipedia or papers) which matches our 
> terminology and what we implemented
> h3. RandomProjection (or P-Stable Distributions)
> * Rename {{RandomProjection}}
> ** Options include: {{ScalarRandomProjectionLSH}}, 
> {{BucketedRandomProjectionLSH}}, {{PStableLSH}}
> * API privacy
> ** Make randUnitVectors private
> * hashFunction
> ** Currently, this uses OR-amplification for single probing, as we intended.
> ** It does *not* do multiple probing, at least not in the sense of the 
> original MPLSH paper.  We should fix that or at least document its behavior.
> * Documentation
> ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
> ** Also link to the multi-probe LSH paper since that explains this method 
> very clearly.
> ** Clarify hash function and distance metric
> h3. MinHash
> * Rename {{MinHash}} -> {{MinHashLSH}}
> * API privacy
> ** Make randCoefficients, numEntries private
> * hashDistance (used in approxNearestNeighbors)
> ** Update to use average of indicators of hash collisions [SPARK-18334]
> ** See [Wikipedia | 
> https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
> reference
> h3. All references
> I'm just listing references I looked at.
> Papers
> * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
> * [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
> * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
> * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe 
> LSH paper
> Wikipedia
> * 
> [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
> * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19603) Fix StreamingQuery explain command

2017-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867037#comment-15867037
 ] 

Apache Spark commented on SPARK-19603:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16934

> Fix StreamingQuery explain command
> --
>
> Key: SPARK-19603
> URL: https://issues.apache.org/jira/browse/SPARK-19603
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now StreamingQuery.explain doesn't show the correct streaming physical 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19603) Fix StreamingQuery explain command

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19603:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Fix StreamingQuery explain command
> --
>
> Key: SPARK-19603
> URL: https://issues.apache.org/jira/browse/SPARK-19603
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> Right now StreamingQuery.explain doesn't show the correct streaming physical 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19603) Fix StreamingQuery explain command

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19603:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Fix StreamingQuery explain command
> --
>
> Key: SPARK-19603
> URL: https://issues.apache.org/jira/browse/SPARK-19603
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now StreamingQuery.explain doesn't show the correct streaming physical 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19603) Fix StreamingQuery explain command

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19603:
-
Summary: Fix StreamingQuery explain command  (was: Fix the stream explain 
command)

> Fix StreamingQuery explain command
> --
>
> Key: SPARK-19603
> URL: https://issues.apache.org/jira/browse/SPARK-19603
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now StreamingQuery.explain doesn't show the correct streaming physical 
> plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19603) Fix the stream explain command

2017-02-14 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19603:


 Summary: Fix the stream explain command
 Key: SPARK-19603
 URL: https://issues.apache.org/jira/browse/SPARK-19603
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.1.0, 2.0.2
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


Right now StreamingQuery.explain doesn't show the correct streaming physical 
plan.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-14 Thread Sunitha Kambhampati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867024#comment-15867024
 ] 

Sunitha Kambhampati commented on SPARK-19602:
-

Attaching the design doc and the proposed solution.  I'd appreciate any 
feedback/suggestions. Also, I will look into submitting a PR soon. 


> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-14 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Attachment: Design_ColResolution_JIRA19602.docx

> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
> Attachments: Design_ColResolution_JIRA19602.docx
>
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-14 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Description: 
1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, db2.t1

Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
Note: In DB2, Oracle, the notion is of ..

2) Another scenario where this fully qualified name is useful is as follows:

  // current database is db1. 
  select t1.i1 from t1, db2.t1   

If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
error during column resolution in the analyzer, as it is ambiguous. 
Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 
only has i1 column. The query would still succeed instead of throwing an error. 
 
One way to avoid confusion would be to explicitly specify using the fully 
qualified name db1.t1.i1 
For e.g:  select db1.t1.i1 from t1, db2.t1  

Workarounds:
There is a workaround for these issues, which is to use an alias. 


  was:
1) Spark SQL fails to analyze this query:  
{quote}
select db1.t1.i1 from db1.t1, db2.t1
{quote}
- Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
- Note: In DB2, Oracle, the notion is of ..

2) Another scenario where this fully qualified name is useful is as follows:
{quote}
// current database is db1. 
select t1.i1 from t1, db2.t1   
{quote}
- If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
error during column resolution in the analyzer, as it is ambiguous. 
- Lets say the user intended to retrieve i1 from db1.t1 but in the example 
db2.t1 only has i1 column. The query would still succeed instead of throwing an 
error.  
- One way to avoid confusion would be to explicitly specify using the fully 
qualified name db1.t1.i1 
For e.g:  select db1.t1.i1 from t1, db2.t1  

Workarounds:
There is a workaround for these issues, which is to use an alias. 



> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
>
> 1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, 
> db2.t1
> Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
>   // current database is db1. 
>   select t1.i1 from t1, db2.t1   
> If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
> error during column resolution in the analyzer, as it is ambiguous. 
> Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-14 Thread Sunitha Kambhampati (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sunitha Kambhampati updated SPARK-19602:

Description: 
1) Spark SQL fails to analyze this query:  
{quote}
select db1.t1.i1 from db1.t1, db2.t1
{quote}
- Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
- Note: In DB2, Oracle, the notion is of ..

2) Another scenario where this fully qualified name is useful is as follows:
{quote}
// current database is db1. 
select t1.i1 from t1, db2.t1   
{quote}
- If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
error during column resolution in the analyzer, as it is ambiguous. 
- Lets say the user intended to retrieve i1 from db1.t1 but in the example 
db2.t1 only has i1 column. The query would still succeed instead of throwing an 
error.  
- One way to avoid confusion would be to explicitly specify using the fully 
qualified name db1.t1.i1 
For e.g:  select db1.t1.i1 from t1, db2.t1  

Workarounds:
There is a workaround for these issues, which is to use an alias. 


  was:
1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, db2.t1
Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
Note: In DB2, Oracle, the notion is of ..

2) Another scenario where this fully qualified name is useful is as follows:

// current database is db1. 
select t1.i1 from t1, db2.t1   

If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
error during column resolution in the analyzer, as it is ambiguous. 

Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 
only has i1 column. The query would still succeed instead of throwing an error. 
 
One way to avoid confusion would be to explicitly specify using the fully 
qualified name db1.t1.i1 
For e.g:  select db1.t1.i1 from t1, db2.t1  

Workarounds:
There is a workaround for these issues, which is to use an alias. 



> Unable to query using the fully qualified column name of the form ( 
> ..)
> --
>
> Key: SPARK-19602
> URL: https://issues.apache.org/jira/browse/SPARK-19602
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Sunitha Kambhampati
>
> 1) Spark SQL fails to analyze this query:  
> {quote}
> select db1.t1.i1 from db1.t1, db2.t1
> {quote}
> - Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
> - Note: In DB2, Oracle, the notion is of ..
> 2) Another scenario where this fully qualified name is useful is as follows:
> {quote}
> // current database is db1. 
> select t1.i1 from t1, db2.t1   
> {quote}
> - If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw 
> an error during column resolution in the analyzer, as it is ambiguous. 
> - Lets say the user intended to retrieve i1 from db1.t1 but in the example 
> db2.t1 only has i1 column. The query would still succeed instead of throwing 
> an error.  
> - One way to avoid confusion would be to explicitly specify using the fully 
> qualified name db1.t1.i1 
> For e.g:  select db1.t1.i1 from t1, db2.t1  
> Workarounds:
> There is a workaround for these issues, which is to use an alias. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)

2017-02-14 Thread Sunitha Kambhampati (JIRA)

Sunitha Kambhampati created SPARK-19602:
---

 Summary: Unable to query using the fully qualified column name of 
the form ( ..)
 Key: SPARK-19602
 URL: https://issues.apache.org/jira/browse/SPARK-19602
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Sunitha Kambhampati


1) Spark SQL fails to analyze this query:  select db1.t1.i1 from db1.t1, db2.t1
Most of the other database systems support this ( e.g DB2, Oracle, MySQL).
Note: In DB2, Oracle, the notion is of ..

2) Another scenario where this fully qualified name is useful is as follows:

// current database is db1. 
select t1.i1 from t1, db2.t1   

If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an 
error during column resolution in the analyzer, as it is ambiguous. 

Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 
only has i1 column. The query would still succeed instead of throwing an error. 
 
One way to avoid confusion would be to explicitly specify using the fully 
qualified name db1.t1.i1 
For e.g:  select db1.t1.i1 from t1, db2.t1  

Workarounds:
There is a workaround for these issues, which is to use an alias. 




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14894) Python GaussianMixture summary

2017-02-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-14894.
--
Resolution: Duplicate

[~wangmiao1981], I guess we could take an action to JIRA too if we are very 
sure according to http://spark.apache.org/contributing.html

{quote}
Most contributors are able to directly resolve JIRAs. Use judgment in 
determining whether you are quite confident the issue should be resolved, 
although changes can be easily undone.
{quote}

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab

2017-02-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867000#comment-15867000
 ] 

Ruslan Dautkhanov commented on SPARK-19588:
---

Got it. Thanks [~vanzin]

> Allow putting keytab file to HDFS location specified in spark.yarn.keytab
> -
>
> Key: SPARK-19588
> URL: https://issues.apache.org/jira/browse/SPARK-19588
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.0.2, 2.1.0
> Environment: kerberized cluster, Spark 2
>Reporter: Ruslan Dautkhanov
>  Labels: authentication, kerberos, security, yarn-client
>
> As a workaround for SPARK-19038 tried putting keytab in user's home directory 
> in HDFS but this fails with 
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> hdfs:///user/svc_odiprd/.kt does not exist
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}
> This is yarn-client mode, so driver probably can't see HDFS while submitting 
> a job; although I suspect it doesn't not only with yarn-client.
> Would be great to support reading keytab for kerberos ticket renewals 
> directly from HDFS.
> We think that in some scenarios it's more secure than referencing a keytab 
> from a local fs on a client machine that does a spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14894) Python GaussianMixture summary

2017-02-14 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866999#comment-15866999
 ] 

Miao Wang commented on SPARK-14894:
---

This is a dup of JIRA-18282. Should be closed.

> Python GaussianMixture summary
> --
>
> Key: SPARK-14894
> URL: https://issues.apache.org/jira/browse/SPARK-14894
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.ml, GaussianMixture includes a result summary.  The Python API 
> should provide the same functionality.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19528) external shuffle service would close while still have request from executor when dynamic allocation is enabled

2017-02-14 Thread satheessh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866962#comment-15866962
 ] 

satheessh commented on SPARK-19528:
---

I am also getting same error from container " ERROR 
client.TransportResponseHandler: Still have 1 requests outstanding when 
connection from"

Node manager Logs:


at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, 
chunkIndex=79}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/23/shuffle_1_46802_0.data,
 offset=59855197, length=32076}} to /172.20.96.35:39880; closing connection
java.nio.channels.ClosedChannelException
at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, 
chunkIndex=80}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/23/shuffle_1_48408_0.data,
 offset=61388120, length=38889}} to /172.20.96.35:39880; closing connection
java.nio.channels.ClosedChannelException
at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, 
chunkIndex=81}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/3d/shuffle_1_48518_0.data,
 offset=61130291, length=35265}} to /172.20.96.35:39880; closing connection
java.nio.channels.ClosedChannelException
at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, 
chunkIndex=82}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/2d/shuffle_1_48640_0.data,
 offset=77145914, length=61274}} to /172.20.96.35:39880; closing connection
java.nio.channels.ClosedChannelException
at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, 
chunkIndex=83}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/15/shuffle_1_49038_0.data,
 offset=62340136, length=39151}} to /172.20.96.35:39880; closing connection
java.nio.channels.ClosedChannelException
at 
org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown
 Source)
2017-02-14 19:27:36,300 ERROR 
org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-17): 
Error sending result 
ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027030512, 
chunkIndex=3}, 
buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-37dc54f8-1f23-4fc4-bf5e-d0b9a1837867/29/shuffle_1_95_0.data,
 offset=1442114, length=37869}} to /172.20.101.23:37866; closing connection
java.io.IOException: Broken pipe
at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
at 
sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
at 
sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
at 
org.spark_project.io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139)
at 
org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121)
at 
org.spark_project.io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287)
at 
org.spark_project.io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
at 
org.spark_project.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314)
at

[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab

2017-02-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866932#comment-15866932
 ] 

Marcelo Vanzin commented on SPARK-19588:


bq. driver/yarn#client holds keytab just to distribute it to executors, isn't 
it? 

Not currently, it actually tries to login with the keytab before submitting the 
application. I'm not a fan of that behavior (it confuses a lot of people who 
try to use it as a generic kerberos login for Spark, and run into issues 
because there seem to be a few outstanding issues with the code), but it's a 
little questionable whether it can be changed now (since some people might be 
relying on it).

> Allow putting keytab file to HDFS location specified in spark.yarn.keytab
> -
>
> Key: SPARK-19588
> URL: https://issues.apache.org/jira/browse/SPARK-19588
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.0.2, 2.1.0
> Environment: kerberized cluster, Spark 2
>Reporter: Ruslan Dautkhanov
>  Labels: authentication, kerberos, security, yarn-client
>
> As a workaround for SPARK-19038 tried putting keytab in user's home directory 
> in HDFS but this fails with 
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> hdfs:///user/svc_odiprd/.kt does not exist
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}
> This is yarn-client mode, so driver probably can't see HDFS while submitting 
> a job; although I suspect it doesn't not only with yarn-client.
> Would be great to support reading keytab for kerberos ticket renewals 
> directly from HDFS.
> We think that in some scenarios it's more secure than referencing a keytab 
> from a local fs on a client machine that does a spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-02-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19318.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16891
[https://github.com/apache/spark/pull/16891]

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-19318
> URL: https://issues.apache.org/jira/browse/SPARK-19318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
> Fix For: 2.2.0
>
>
> = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
> data types to be mapped to Oracle' =
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals("class java.sql.Date") was false 
> (OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`

2017-02-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19318:
---

Assignee: Suresh Thalamati

> Docker test case failure: `SPARK-16625: General data types to be mapped to 
> Oracle`
> --
>
> Key: SPARK-19318
> URL: https://issues.apache.org/jira/browse/SPARK-19318
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Suresh Thalamati
> Fix For: 2.2.0
>
>
> = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General 
> data types to be mapped to Oracle' =
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>   types.apply(9).equals("class java.sql.Date") was false 
> (OracleIntegrationSuite.scala:136)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab

2017-02-14 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866839#comment-15866839
 ] 

Ruslan Dautkhanov commented on SPARK-19588:
---

driver/yarn#client holds keytab just to distribute it to executors, isn't it? 
If so, then we don't need to "download" to local disk for driver/yarn#client as 
executors will already have access to keytab in HDFS.
Am I missing something?

> Allow putting keytab file to HDFS location specified in spark.yarn.keytab
> -
>
> Key: SPARK-19588
> URL: https://issues.apache.org/jira/browse/SPARK-19588
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Spark Submit
>Affects Versions: 2.0.2, 2.1.0
> Environment: kerberized cluster, Spark 2
>Reporter: Ruslan Dautkhanov
>  Labels: authentication, kerberos, security, yarn-client
>
> As a workaround for SPARK-19038 tried putting keytab in user's home directory 
> in HDFS but this fails with 
> {noformat}
> Exception in thread "main" org.apache.spark.SparkException: Keytab file: 
> hdfs:///user/svc_odiprd/.kt does not exist
> at 
> org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {noformat}
> This is yarn-client mode, so driver probably can't see HDFS while submitting 
> a job; although I suspect it doesn't not only with yarn-client.
> Would be great to support reading keytab for kerberos ticket renewals 
> directly from HDFS.
> We think that in some scenarios it's more secure than referencing a keytab 
> from a local fs on a client machine that does a spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"

2017-02-14 Thread Armin Braun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Armin Braun resolved SPARK-19275.
-
Resolution: Not A Problem

> Spark Streaming, Kafka receiver, "Failed to get records for ... after polling 
> for 512"
> --
>
> Key: SPARK-19275
> URL: https://issues.apache.org/jira/browse/SPARK-19275
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.0
> Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11
>Reporter: Dmitry Ochnev
>
> We have a Spark Streaming application reading records from Kafka 0.10.
> Some tasks are failed because of the following error:
> "java.lang.AssertionError: assertion failed: Failed to get records for (...) 
> after polling for 512"
> The first attempt fails and the second attempt (retry) completes 
> successfully, - this is the pattern that we see for many tasks in our logs. 
> These fails and retries consume resources.
> A similar case with a stack trace are described here:
> https://www.mail-archive.com/user@spark.apache.org/msg56564.html
> https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767
> Here is the line from the stack trace where the error is raised:
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74)
> We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, 
> 10, 30 and 60 seconds, but the error appeared in all the cases except the 
> last one. Moreover, increasing the threshold led to increasing total Spark 
> stage duration.
> In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to 
> fewer task failures but with cost of total stage duration. So, it is bad for 
> performance when processing data streams.
> We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other 
> related classes) which inhibits the reading process.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16475) Broadcast Hint for SQL Queries

2017-02-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-16475:
---

Assignee: Reynold Xin

> Broadcast Hint for SQL Queries
> --
>
> Key: SPARK-16475
> URL: https://issues.apache.org/jira/browse/SPARK-16475
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: BroadcastHintinSparkSQL.pdf
>
>
> Broadcast hint is a way for users to manually annotate a query and suggest to 
> the query optimizer the join method. It is very useful when the query 
> optimizer cannot make optimal decision with respect to join methods due to 
> conservativeness or the lack of proper statistics.
> The DataFrame API has broadcast hint since Spark 1.5. However, we do not have 
> an equivalent functionality in SQL queries. We propose adding Hive-style 
> broadcast hint to Spark SQL.
> For more information, please see the attached document. One note about the 
> doc: in addition to supporting "MAPJOIN", we should also support 
> "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be 
> accepted:
> {code}
> SELECT /*+ MAPJOIN(b) */ ...
> SELECT /*+ BROADCASTJOIN(b) */ ...
> SELECT /*+ BROADCAST(b) */ ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16475) Broadcast Hint for SQL Queries

2017-02-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-16475.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16925
[https://github.com/apache/spark/pull/16925]

> Broadcast Hint for SQL Queries
> --
>
> Key: SPARK-16475
> URL: https://issues.apache.org/jira/browse/SPARK-16475
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>  Labels: releasenotes
> Fix For: 2.2.0
>
> Attachments: BroadcastHintinSparkSQL.pdf
>
>
> Broadcast hint is a way for users to manually annotate a query and suggest to 
> the query optimizer the join method. It is very useful when the query 
> optimizer cannot make optimal decision with respect to join methods due to 
> conservativeness or the lack of proper statistics.
> The DataFrame API has broadcast hint since Spark 1.5. However, we do not have 
> an equivalent functionality in SQL queries. We propose adding Hive-style 
> broadcast hint to Spark SQL.
> For more information, please see the attached document. One note about the 
> doc: in addition to supporting "MAPJOIN", we should also support 
> "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be 
> accepted:
> {code}
> SELECT /*+ MAPJOIN(b) */ ...
> SELECT /*+ BROADCASTJOIN(b) */ ...
> SELECT /*+ BROADCAST(b) */ ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19593:
-
Priority: Trivial  (was: Critical)

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Trivial
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42K/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19593:
-
Component/s: (was: Structured Streaming)
 (was: Spark Core)
 DStreams

> Records read per each kinesis transaction
> -
>
> Key: SPARK-19593
> URL: https://issues.apache.org/jira/browse/SPARK-19593
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.0.1
>Reporter: Sarath Chandra Jiguru
>Priority: Critical
>
> The question is related to spark streaming+kinesis integration
> Is there a way to provide a kinesis consumer configuration. Ex: Number  of 
> records read per each transaction etc. 
> Right now, even though, I am eligible to read 2.8G/minute, I am restricted to 
> read only 100MB/minute, as I am not able to increase the default number of 
> records in each transaction.
> I have raised a question in stackoverflow as well, please look into it:
> http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming
> Kinesis stream setup:
> open shards: 24
> write rate: 440K/minute
> read rate: 1.42K/minute
> read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 
> Shards*2 MB*60 Seconds)
> Reference: 
> http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists

2017-02-14 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866778#comment-15866778
 ] 

Shixiong Zhu commented on SPARK-19594:
--

Good catch. Would you like to submit a PR to fix it?

> StreamingQueryListener fails to handle QueryTerminatedEvent if more then one 
> listeners exists
> -
>
> Key: SPARK-19594
> URL: https://issues.apache.org/jira/browse/SPARK-19594
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Eyal Zituny
>Priority: Minor
>
> reproduce:
> *create a spark session
> *add multiple streaming query listeners
> *create a simple query
> *stop the query
> result -> only the first listener handle the QueryTerminatedEvent
> this might happen because the query run id is being removed from 
> activeQueryRunIds once the onQueryTerminated is called 
> (StreamingQueryListenerBus:115)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19387) CRAN tests do not run with SparkR source package

2017-02-14 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-19387.
---
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1

Issue resolved by pull request 16720
[https://github.com/apache/spark/pull/16720]

> CRAN tests do not run with SparkR source package
> 
>
> Key: SPARK-19387
> URL: https://issues.apache.org/jira/browse/SPARK-19387
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Felix Cheung
>Assignee: Felix Cheung
> Fix For: 2.1.1, 2.2.0
>
>
> It looks like sparkR.session() is not installing Spark - as a result, running 
> R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to 
> CRAN.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866768#comment-15866768
 ] 

Timothy Hunter commented on SPARK-19208:


Yes, I meant returning a struct and then projecting this struct. I do not think 
there is any other way right now with the current UDAFs, as you mention. In 
that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, 
the fields of which are decided by the arguments in {{.metrics}}, and each of 
the individual functions  {{VectorSummarizer.min/max/variasce(...)}} returns 
columns of vectors or matrices.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755
 ] 

Nick Pentreath commented on SPARK-19208:


Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755
 ] 

Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM:
-

Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{code}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{code}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?


was (Author: mlnick):
Ah right I see - yes rewrite rules would be a good ultimate goal.

One question I have - if we do:

{{code}}
val summary = df.select(VectorSummarizer.metrics("min", 
"max").summary("features"))
{{code}}

How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple 
return cols are not supported by UDAF?

Or do we have to live with the struct return type for now until we could do the 
rewrite version?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-19523:
-
Issue Type: Question  (was: Improvement)

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Question
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19523.
--
Resolution: Not A Bug

> Spark streaming+ insert into table leaves bunch of trash in table directory
> ---
>
> Key: SPARK-19523
> URL: https://issues.apache.org/jira/browse/SPARK-19523
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams, SQL
>Affects Versions: 2.0.2
>Reporter: Egor Pahomov
>Priority: Minor
>
> I have very simple code, which transform coming json files into pq table:
> {code}
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.io.{LongWritable, Text}
> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
> import org.apache.spark.sql.SaveMode
> object Client_log {
>   def main(args: Array[String]): Unit = {
> val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * 
> from temp.x_streaming where year=2015 and month=12 and day=1").dtypes
> var columns = resultCols.filter(x => 
> !Commons.stopColumns.contains(x._1)).map({ case (name, types) => {
>   s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as 
> ${Commons.mapType(types)}) as $name"""
> }
> })
> columns ++= List("'streaming' as sourcefrom")
> def f(path:Path): Boolean = {
>   true
> }
> val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, 
> TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false)
> client_log_d_stream.foreachRDD(rdd => {
>   val localHiveContext = new HiveContext(rdd.sparkContext)
>   import localHiveContext.implicits._
>   var input = rdd.map(x => Record(x._2.toString)).toDF()
>   input = input.selectExpr(columns: _*)
>   input =
> SmallOperators.populate(input, resultCols)
>   input
> .write
> .mode(SaveMode.Append)
> .format("parquet")
> .insertInto("temp.x_streaming")
> })
> Spark.ssc.start()
> Spark.ssc.awaitTermination()
>   }
>   case class Record(s: String)
> }
> {code}
> This code generates a lot of trash directories in resalt table like:
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1
> drwxrwxrwt   3 egor   nobody 4096 Feb  8 14:15 
> .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-19497:


Assignee: Shixiong Zhu

> dropDuplicates with watermark
> -
>
> Key: SPARK-19497
> URL: https://issues.apache.org/jira/browse/SPARK-19497
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Michael Armbrust
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM:
-

Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), 
col("vector_summary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?


was (Author: timhunter):
Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter commented on SPARK-19208:


Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19601:


Assignee: Apache Spark  (was: Xiao Li)

> Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
> 
>
> Key: SPARK-19601
> URL: https://issues.apache.org/jira/browse/SPARK-19601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> When users use the shuffle-enabled `repartition` API, they expect the 
> partition they got should be the exact number they provided, even if they 
> call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule 
> does not consider whether shuffle is enabled or not. Thus, we got the 
> following unexpected result.
> {noformat}
> val df = spark.range(0, 1, 1, 5)
> val df2 = df.repartition(10)
> assert(df2.coalesce(13).rdd.getNumPartitions == 5)
> assert(df2.coalesce(7).rdd.getNumPartitions == 5)
> assert(df2.coalesce(3).rdd.getNumPartitions == 3)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition

2017-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866709#comment-15866709
 ] 

Apache Spark commented on SPARK-19601:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/16933

> Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
> 
>
> Key: SPARK-19601
> URL: https://issues.apache.org/jira/browse/SPARK-19601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When users use the shuffle-enabled `repartition` API, they expect the 
> partition they got should be the exact number they provided, even if they 
> call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule 
> does not consider whether shuffle is enabled or not. Thus, we got the 
> following unexpected result.
> {noformat}
> val df = spark.range(0, 1, 1, 5)
> val df2 = df.repartition(10)
> assert(df2.coalesce(13).rdd.getNumPartitions == 5)
> assert(df2.coalesce(7).rdd.getNumPartitions == 5)
> assert(df2.coalesce(3).rdd.getNumPartitions == 3)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19601:


Assignee: Xiao Li  (was: Apache Spark)

> Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
> 
>
> Key: SPARK-19601
> URL: https://issues.apache.org/jira/browse/SPARK-19601
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> When users use the shuffle-enabled `repartition` API, they expect the 
> partition they got should be the exact number they provided, even if they 
> call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule 
> does not consider whether shuffle is enabled or not. Thus, we got the 
> following unexpected result.
> {noformat}
> val df = spark.range(0, 1, 1, 5)
> val df2 = df.repartition(10)
> assert(df2.coalesce(13).rdd.getNumPartitions == 5)
> assert(df2.coalesce(7).rdd.getNumPartitions == 5)
> assert(df2.coalesce(3).rdd.getNumPartitions == 3)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition

2017-02-14 Thread Xiao Li (JIRA)

Xiao Li created SPARK-19601:
---

 Summary: Fix CollapseRepartition rule to preserve shuffle-enabled 
Repartition
 Key: SPARK-19601
 URL: https://issues.apache.org/jira/browse/SPARK-19601
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.2
Reporter: Xiao Li
Assignee: Xiao Li


When users use the shuffle-enabled `repartition` API, they expect the partition 
they got should be the exact number they provided, even if they call 
shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule does 
not consider whether shuffle is enabled or not. Thus, we got the following 
unexpected result.

{noformat}
val df = spark.range(0, 1, 1, 5)
val df2 = df.repartition(10)
assert(df2.coalesce(13).rdd.getNumPartitions == 5)
assert(df2.coalesce(7).rdd.getNumPartitions == 5)
assert(df2.coalesce(3).rdd.getNumPartitions == 3)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Nick Pentreath (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866675#comment-15866675
 ] 

Nick Pentreath commented on SPARK-19208:


When I said "estimator-like", I didn't mean it should necessarily be an actual 
{{Estimator}} (I agree it is not really intended to fit into transformers & 
pipelines), but rather mimic the API, i.e. that the summarizer is "fitted" on a 
dataset to return a summary.

I just wasn't too keen on the idea of returning a struct as it just feels sort 
of clunky relative to returning a df with vector columns {{"mean", "min", 
"max"}} etc.

Supporting SS and {{groupBy}} seems like an important goal, so something like 
[~timhunter]'s suggestion looks like it will work nicely.

For doing it via catalyst rules, that would be first prize to automatically 
re-use the intermediate results for multiple end-result computations, and only 
compute what is necessary for the required end-results. But, is that even 
supported for UDTs currently? I'm not an expert but my understanding was that 
is not supported yet.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS

2017-02-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866544#comment-15866544
 ] 

Sean Owen commented on SPARK-19600:
---

You indicated it was the same issue, but, it's still not the right place to ask 
the question.

> ArrayIndexOutOfBoundsException in ALS
> -
>
> Key: SPARK-19600
> URL: https://issues.apache.org/jira/browse/SPARK-19600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: zhengxiang pan
>Priority: Blocker
>
> Understand issue SPARK-3080 closed, but I don't understand yet what cause the 
> issue: memory, parallelism, negative userID or product ID?
> I consistently ran into this issue with different set of training set, can 
> you suggest any area to look at?
> java.lang.ArrayIndexOutOfBoundsException: 221529807
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944)
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM:
-

I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have an API like 
this:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.



was (Author: timhunter):
I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter commented on SPARK-19208:


I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-02-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866523#comment-15866523
 ] 

Herman van Hovell commented on SPARK-19518:
---

Go for it.

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-02-14 Thread Herman van Hovell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866523#comment-15866523
 ] 

Herman van Hovell edited comment on SPARK-19518 at 2/14/17 8:00 PM:


Go for it!


was (Author: hvanhovell):
Go for it.

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19501) Slow checking if there are many spark.yarn.jars, which are already on HDFS

2017-02-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-19501.

   Resolution: Fixed
 Assignee: Jong Wook Kim
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> Slow checking if there are many spark.yarn.jars, which are already on HDFS
> --
>
> Key: SPARK-19501
> URL: https://issues.apache.org/jira/browse/SPARK-19501
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Jong Wook Kim
>Assignee: Jong Wook Kim
>Priority: Minor
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> Hi, this is my first Spark issue submission and please excuse any 
> inconsistencies.
> I am experiencing a slower application startup time when I specify many files 
> as {{spark.yarn.jars}}, by setting it as all JARs in an HDFS folder, such as 
> {{hdfs://namenode/user/spark/lib/*.jar}}.
> Since the JAR files are already on the same HDFS that YARN is running, the 
> application should be very fast to startup. However, the delay is significant 
> especially when {{spark-submit}} is running from a non-local network, because 
> {{spark-yarn}} accesses the individual JAR files via HDFS, adding hundreds of 
> RTT before the application is ready. The official spark distribution with 
> Hadoop 2.7 has more than 200 jars, and >100 even if we exclude Hadoop and its 
> dependencies.
> There are currently two HDFS RPC calls for each file, once at 
> {{ClientDistributedCacheManager.addResource}} calling {{fs.getFileStatus}}, 
> and another at {{yarn.Client.copyFileToRemote}} calling {{fc.resolvePath}}. I 
> suppose that both are unnecessary, since we [already retrieved all 
> FileStatuses, and that those are not 
> symlinks|https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531].
> To fix this, I suppose that we can modify {{addResource}} to use its 
> {{statCache}} variable before making an HDFS RPC and populate {{statCache}} 
> appropriately before calling {{addResource}}. Also, an optional boolean 
> parameter of {{copyFileToRemote}} can be added to skip the symlink check.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS

2017-02-14 Thread zhengxiang pan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866516#comment-15866516
 ] 

zhengxiang pan commented on SPARK-19600:


 are you sure it is duplicated issue as SPARK-3080 even without looking at?

> ArrayIndexOutOfBoundsException in ALS
> -
>
> Key: SPARK-19600
> URL: https://issues.apache.org/jira/browse/SPARK-19600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: zhengxiang pan
>Priority: Blocker
>
> Understand issue SPARK-3080 closed, but I don't understand yet what cause the 
> issue: memory, parallelism, negative userID or product ID?
> I consistently ran into this issue with different set of training set, can 
> you suggest any area to look at?
> java.lang.ArrayIndexOutOfBoundsException: 221529807
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944)
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS

2017-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19600.
---
Resolution: Duplicate

Questions go to the mailing list. Don't open duplicate JIRAs to re-ask.

> ArrayIndexOutOfBoundsException in ALS
> -
>
> Key: SPARK-19600
> URL: https://issues.apache.org/jira/browse/SPARK-19600
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.0.1
>Reporter: zhengxiang pan
>Priority: Blocker
>
> Understand issue SPARK-3080 closed, but I don't understand yet what cause the 
> issue: memory, parallelism, negative userID or product ID?
> I consistently ran into this issue with different set of training set, can 
> you suggest any area to look at?
> java.lang.ArrayIndexOutOfBoundsException: 221529807
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944)
> at 
> org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements

2017-02-14 Thread Sameer Abhyankar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866461#comment-15866461
 ] 

Sameer Abhyankar commented on SPARK-19518:
--

I have come across this issue recently with Spark 2.0 and would like to work on 
this unless it is already assigned to someone else!

> IGNORE NULLS in first_value / last_value should be supported in SQL statements
> --
>
> Key: SPARK-19518
> URL: https://issues.apache.org/jira/browse/SPARK-19518
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Ferenc Erdelyi
>
> https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, 
> however it does not work in SQL statements as it is not implemented in Hive 
> yet: https://issues.apache.org/jira/browse/HIVE-11189



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866445#comment-15866445
 ] 

Apache Spark commented on SPARK-19599:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/16932

> Clean up HDFSMetadataLog for Hadoop 2.6+
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19599:


Assignee: Apache Spark

> Clean up HDFSMetadataLog for Hadoop 2.6+
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19599:


Assignee: (was: Apache Spark)

> Clean up HDFSMetadataLog for Hadoop 2.6+
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-19529.

Resolution: Fixed

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS

2017-02-14 Thread zhengxiang pan (JIRA)

zhengxiang pan created SPARK-19600:
--

 Summary: ArrayIndexOutOfBoundsException in ALS
 Key: SPARK-19600
 URL: https://issues.apache.org/jira/browse/SPARK-19600
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.1
Reporter: zhengxiang pan
Priority: Blocker


Understand issue SPARK-3080 closed, but I don't understand yet what cause the 
issue: memory, parallelism, negative userID or product ID?

I consistently ran into this issue with different set of training set, can you 
suggest any area to look at?

java.lang.ArrayIndexOutOfBoundsException: 221529807
at 
org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944)
at 
org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-19529:
---
Fix Version/s: 1.6.4

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-19529:
---
Target Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0  (was: 1.6.3, 2.0.3, 2.1.1, 
2.2.0)

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0
>
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-19599:
---
Affects Version/s: (was: 2.1.0)
   2.2.0

> Clean up HDFSMetadataLog for Hadoop 2.6+
> 
>
> Key: SPARK-19599
> URL: https://issues.apache.org/jira/browse/SPARK-19599
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>
> SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
> cleanup for HDFSMetadataLog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()

2017-02-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-19529:
---
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
> ---
>
> Key: SPARK-19529
> URL: https://issues.apache.org/jira/browse/SPARK-19529
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>
> In Spark's Netty RPC layer, TransportClientFactory.createClient() calls 
> awaitUninterruptibly() on a Netty future while waiting for a connection to be 
> established. This creates problem when a Spark task is interrupted while 
> blocking in this call (which can happen in the event of a slow connection 
> which will eventually time out). This has bad impacts on task cancellation 
> when interruptOnCancel = true.
> As an example of the impact of this problem, I experienced significant 
> numbers of uncancellable "zombie tasks" on a production cluster where several 
> tasks were blocked trying to connect to a dead shuffle server and then 
> continued running as zombies after I cancelled the associated Spark stage. 
> The zombie tasks ran for several minutes with the following stack:
> {code}
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:460)
> io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) 
> io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
>  
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>  => holding Monitor(java.lang.Object@1849476028}) 
> org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>  
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>  
> org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
> 350) 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
>  
> org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120)
>  
> org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
>  
> org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
>  
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) 
> [...]
> {code}
> I believe that we can easily fix this by using the 
> InterruptedException-throwing await() instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+

2017-02-14 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-19599:


 Summary: Clean up HDFSMetadataLog for Hadoop 2.6+
 Key: SPARK-19599
 URL: https://issues.apache.org/jira/browse/SPARK-19599
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Shixiong Zhu


SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some 
cleanup for HDFSMetadataLog



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19587) Disallow when sort columns are part of partitioning columns

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19587:


Assignee: (was: Apache Spark)

> Disallow when sort columns are part of partitioning columns
> ---
>
> Key: SPARK-19587
> URL: https://issues.apache.org/jira/browse/SPARK-19587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> This came up in discussion at 
> https://github.com/apache/spark/pull/16898#discussion_r100697138
> Allowing partition columns to be a part of sort columns should not be 
> supported (logically it does not make sense). 
> {code}
> df.write
>   .format(source)
>   .partitionBy("i")
>   .bucketBy(8, "x")
>   .sortBy("i")
>   .saveAsTable("bucketed_table")
> {code}
> Hive fails for such case.
> {code}
> CREATE TABLE user_info_bucketed(user_id BIGINT) 
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(user_id)
> SORTED BY (ds ASC)
> INTO 8 BUCKETS;
> 
> FAILED: SemanticException [Error 10002]: Invalid column reference
> Caused by: SemanticException: Invalid column reference
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19587) Disallow when sort columns are part of partitioning columns

2017-02-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866437#comment-15866437
 ] 

Apache Spark commented on SPARK-19587:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/16931

> Disallow when sort columns are part of partitioning columns
> ---
>
> Key: SPARK-19587
> URL: https://issues.apache.org/jira/browse/SPARK-19587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>
> This came up in discussion at 
> https://github.com/apache/spark/pull/16898#discussion_r100697138
> Allowing partition columns to be a part of sort columns should not be 
> supported (logically it does not make sense). 
> {code}
> df.write
>   .format(source)
>   .partitionBy("i")
>   .bucketBy(8, "x")
>   .sortBy("i")
>   .saveAsTable("bucketed_table")
> {code}
> Hive fails for such case.
> {code}
> CREATE TABLE user_info_bucketed(user_id BIGINT) 
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(user_id)
> SORTED BY (ds ASC)
> INTO 8 BUCKETS;
> 
> FAILED: SemanticException [Error 10002]: Invalid column reference
> Caused by: SemanticException: Invalid column reference
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-19587) Disallow when sort columns are part of partitioning columns

2017-02-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19587:


Assignee: Apache Spark

> Disallow when sort columns are part of partitioning columns
> ---
>
> Key: SPARK-19587
> URL: https://issues.apache.org/jira/browse/SPARK-19587
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Tejas Patil
>Assignee: Apache Spark
>
> This came up in discussion at 
> https://github.com/apache/spark/pull/16898#discussion_r100697138
> Allowing partition columns to be a part of sort columns should not be 
> supported (logically it does not make sense). 
> {code}
> df.write
>   .format(source)
>   .partitionBy("i")
>   .bucketBy(8, "x")
>   .sortBy("i")
>   .saveAsTable("bucketed_table")
> {code}
> Hive fails for such case.
> {code}
> CREATE TABLE user_info_bucketed(user_id BIGINT) 
> PARTITIONED BY(ds STRING)
> CLUSTERED BY(user_id)
> SORTED BY (ds ASC)
> INTO 8 BUCKETS;
> 
> FAILED: SemanticException [Error 10002]: Invalid column reference
> Caused by: SemanticException: Invalid column reference
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed

2017-02-14 Thread Armin Braun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866428#comment-15866428
 ] 

Armin Braun commented on SPARK-19592:
-

Imo this also relates to the ability to handle 
https://issues.apache.org/jira/browse/SPARK-8985 in a clean way btw.

> Duplication in Test Configuration Relating to SparkConf Settings Should be 
> Removed
> --
>
> Key: SPARK-19592
> URL: https://issues.apache.org/jira/browse/SPARK-19592
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0, 2.2.0
> Environment: Applies to all Environments
>Reporter: Armin Braun
>Priority: Minor
>
> This configuration for Surefire, Scalatest is duplicated in the parent POM as 
> well as the SBT build.
> While this duplication cannot be removed in general it can at least be 
> removed for all system properties that simply result in a SparkConf setting I 
> think.
> Instead of having lines like 
> {code}
> false
> {code}
> twice in the pom.xml
> and once in SBT as
> {code}
> javaOptions in Test += "-Dspark.ui.enabled=false",
> {code}
> it would be a lot cleaner to simply have a 
> {code}
> var conf: SparkConf 
> {code}
> field in 
> {code}
> org.apache.spark.SparkFunSuite
> {code}
>  that has SparkConf set up with all the shared configuration that 
> `systemProperties` currently provide. Obviously this cannot be done straight 
> away given that
> many subclasses of the parent suit do this, so I think it would be best to 
> simply add a method to the parent that provides this configuration for now
> and start refactoring away duplication in other suit setups from there step 
> by step until the sys properties can be removed from the pom and sbt.build.
> This makes the build a lot easier to maintain and makes tests more readable 
> by making the environment setup more explicit in the code.
> (also it would allow running more tests straight from the IDE which is always 
> a nice thing imo)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866410#comment-15866410
 ] 

Reynold Xin commented on SPARK-19598:
-

cc [~windpiger] are you interested in working on this? 

> Remove the alias parameter in UnresolvedRelation
> 
>
> Key: SPARK-19598
> URL: https://issues.apache.org/jira/browse/SPARK-19598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> UnresolvedRelation has a second argument named "alias", for assigning the 
> relation an alias. I think we can actually remove it and replace its use with 
> a SubqueryAlias.
> This would actually simplify some analyzer code to only match on 
> SubqueryAlias. For example, the broadcast hint pull request can have one 
> fewer case https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-19598) Remove the alias parameter in UnresolvedRelation

2017-02-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-19598:
---

 Summary: Remove the alias parameter in UnresolvedRelation
 Key: SPARK-19598
 URL: https://issues.apache.org/jira/browse/SPARK-19598
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


UnresolvedRelation has a second argument named "alias", for assigning the 
relation an alias. I think we can actually remove it and replace its use with a 
SubqueryAlias.

This would actually simplify some analyzer code to only match on SubqueryAlias. 
For example, the broadcast hint pull request can have one fewer case 
https://github.com/apache/spark/pull/16925/files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5

2017-02-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung resolved SPARK-19571.
--
Resolution: Fixed
  Assignee: Hyukjin Kwon

> tests are failing to run on Windows with another instance Derby error with 
> Hadoop 2.6.5
> ---
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>Assignee: Hyukjin Kwon
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5

2017-02-14 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-19571:
-
Summary: tests are failing to run on Windows with another instance Derby 
error with Hadoop 2.6.5  (was: appveyor windows tests are failing)

> tests are failing to run on Windows with another instance Derby error with 
> Hadoop 2.6.5
> ---
>
> Key: SPARK-19571
> URL: https://issues.apache.org/jira/browse/SPARK-19571
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.2.0
>Reporter: Felix Cheung
>
> Between 
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master
> https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db
> And
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master
> https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a
> Something is changed (not likely caused by R) such that tests running on 
> Windows are consistently failing with
> {code}
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database 
> C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore.
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at org.apache.derby.iapi.error.StandardException.newException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown
>  Source)
>   at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown 
> Source)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown
>  Source)
>   at 
> org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
>   at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown 
> Source)
>   at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown 
> Source)
> {code}
> Since we run appveyor only when there is R changes, it is a bit harder to 
> track down which change specifically caused this.
> We also can't run appveyor on branch-2.1, so it could also be broken there.
> This could be a blocker, since it could fail tests for the R release.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19163) Lazy creation of the _judf

2017-02-14 Thread Maciej Szymkiewicz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-19163.

   Resolution: Fixed
Fix Version/s: 2.2.0

> Lazy creation of the _judf
> --
>
> Key: SPARK-19163
> URL: https://issues.apache.org/jira/browse/SPARK-19163
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
>Reporter: Maciej Szymkiewicz
> Fix For: 2.2.0
>
>
> Current state
> Right {{UserDefinedFunction}} eagerly creates {{_judf}} and initializes 
> {{SparkSession}} 
> (https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L1832)
>  as a side effect. This behavior may have undesired results when {{udf}} is 
> imported from a module:
> {{myudfs.py}}
> {code}
> from pyspark.sql.functions import udf
> from pyspark.sql.types import IntegerType
> 
> def _add_one(x):
> """Adds one"""
> if x is not None:
> return x + 1
> 
> add_one = udf(_add_one, IntegerType())
> {code}
> 
> 
> Example session:
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: from myudfs import add_one
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/01/07 19:55:44 WARN Utils: Your hostname, xxx resolves to a loopback 
> address: 127.0.1.1; using xxx instead (on interface eth0)
> 17/01/07 19:55:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> In [3]: spark = SparkSession.builder.appName("foo").getOrCreate()
> In [4]: spark.sparkContext.appName
> Out[4]: 'pyspark-shell'
> {code}
> Proposed
> Delay {{_judf}} initialization until the first call.
> {code}
> In [1]: from pyspark.sql import SparkSession
> In [2]: from myudfs import add_one
> In [3]: spark = SparkSession.builder.appName("foo").getOrCreate()
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 17/01/07 19:58:38 WARN Utils: Your hostname, xxx resolves to a loopback 
> address: 127.0.1.1; using xxx instead (on interface eth0)
> 17/01/07 19:58:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> In [4]: spark.sparkContext.appName
> Out[4]: 'foo'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2017-02-14 Thread Shea Parkes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866393#comment-15866393
 ] 

Shea Parkes commented on SPARK-18541:
-

Thank you very much!

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Assignee: Shea Parkes
>Priority: Minor
>  Labels: newbie
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed

2017-02-14 Thread Armin Braun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866367#comment-15866367
 ] 

Armin Braun commented on SPARK-19592:
-

[~srowen] 

{quote}
What about tests that make their own conf or need to?
{quote}

Those tests in particular made me interested in this for 
correctness/readability reasons. Maybe an example helps :)

in _org.apache.spark.streaming.InputStreamsSuite_ we have the situation that 
the conf is set up in the parent suit via just

{code}
  val conf = new SparkConf()
.setMaster(master)
.setAppName(framework)
{code}

now if you run that suit from the IDE one of the tests fails with an apparent 
error in the logic.

{code}
The code passed to eventually never returned normally. Attempted 664 times over 
10.01260721901 seconds. Last failure message: 10 did not equal 5.
{code}

You debug it and find out that it's because you get some _StreamingListener_ 
added to the context twice because the tests adds one manually that is already 
on the context.
Reason for that being that it's also added by the UI when you have 
_spark.ui.enabled_ set to default _true_.

So basically you now have a seemingly redundant line of code in a bunch of 
tests:

{code}
ssc.addStreamingListener(ssc.progressListener)
{code}

...  that appears wrong with the configuration (that you see if you just read 
the code) and requires you to also consider (maintain) what Maven or SBT is 
injecting in terms of environment.
---

So I think those tests that make their own config are the most troublesome 
since they have non-standard defaults injected. In my opinion it would be a lot 
easier to work with if the defaults would just be the standard production env. 
defaults when I create a new instance of the SparkConf and all deviation from 
that would be explicit in the code.

I agree it's not a big pain, still a quality issue worth fixing (imo). Reduces 
maintenance effort from drier test configs and makes tests easier to read like 
in the example above.

> Duplication in Test Configuration Relating to SparkConf Settings Should be 
> Removed
> --
>
> Key: SPARK-19592
> URL: https://issues.apache.org/jira/browse/SPARK-19592
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0, 2.2.0
> Environment: Applies to all Environments
>Reporter: Armin Braun
>Priority: Minor
>
> This configuration for Surefire, Scalatest is duplicated in the parent POM as 
> well as the SBT build.
> While this duplication cannot be removed in general it can at least be 
> removed for all system properties that simply result in a SparkConf setting I 
> think.
> Instead of having lines like 
> {code}
> false
> {code}
> twice in the pom.xml
> and once in SBT as
> {code}
> javaOptions in Test += "-Dspark.ui.enabled=false",
> {code}
> it would be a lot cleaner to simply have a 
> {code}
> var conf: SparkConf 
> {code}
> field in 
> {code}
> org.apache.spark.SparkFunSuite
> {code}
>  that has SparkConf set up with all the shared configuration that 
> `systemProperties` currently provide. Obviously this cannot be done straight 
> away given that
> many subclasses of the parent suit do this, so I think it would be best to 
> simply add a method to the parent that provides this configuration for now
> and start refactoring away duplication in other suit setups from there step 
> by step until the sys properties can be removed from the pom and sbt.build.
> This makes the build a lot easier to maintain and makes tests more readable 
> by making the environment setup more explicit in the code.
> (also it would allow running more tests straight from the IDE which is always 
> a nice thing imo)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19552) Upgrade Netty version to 4.1.8 final

2017-02-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-19552.
--
Resolution: Later

> Upgrade Netty version to 4.1.8 final
> 
>
> Key: SPARK-19552
> URL: https://issues.apache.org/jira/browse/SPARK-19552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.1.0
>Reporter: Adam Roberts
>Priority: Minor
>
> Netty 4.1.8 was recently released but isn't API compatible with previous 
> major versions (like Netty 4.0.x), see 
> http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details.
> This version does include a fix for a security concern but not one we'd be 
> exposed to with Spark "out of the box". Let's upgrade the version we use to 
> be on the safe side as the security fix I'm especially interested in is not 
> available in the 4.0.x release line. 
> We should move up anyway to take on a bunch of other big fixes cited in the 
> release notes (and if anyone were to use Spark with netty and tcnative, they 
> shouldn't be exposed to the security problem) - we should be good citizens 
> and make this change.
> As this 4.1 version involves API changes we'll need to implement a few 
> methods and possibly adjust the Sasl tests. This JIRA and associated pull 
> request starts the process which I'll work on - and any help would be much 
> appreciated! Currently I know:
> {code}
> @Override
> public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise 
> promise)
>   throws Exception {
>   if (!foundEncryptionHandler) {
> foundEncryptionHandler =
>   ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this 
> returns false and causes test failures
>   }
>   ctx.write(msg, promise);
> }
> {code}
> Here's what changes will be required (at least):
> {code}
> common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code}
>  requires touch, retain and transferred methods
> {code}
> common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code}
>  requires the above methods too
> {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code}
> With "dummy" implementations so we can at least compile and test, we'll see 
> five new test failures to address.
> These are
> {code}
> org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption
> org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption
> org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption
> org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed

2017-02-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866311#comment-15866311
 ] 

Sean Owen commented on SPARK-19592:
---

What about tests that make their own conf or need to?
I don't think this is a significant pain to bother with.

> Duplication in Test Configuration Relating to SparkConf Settings Should be 
> Removed
> --
>
> Key: SPARK-19592
> URL: https://issues.apache.org/jira/browse/SPARK-19592
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.1.0, 2.2.0
> Environment: Applies to all Environments
>Reporter: Armin Braun
>Priority: Minor
>
> This configuration for Surefire, Scalatest is duplicated in the parent POM as 
> well as the SBT build.
> While this duplication cannot be removed in general it can at least be 
> removed for all system properties that simply result in a SparkConf setting I 
> think.
> Instead of having lines like 
> {code}
> false
> {code}
> twice in the pom.xml
> and once in SBT as
> {code}
> javaOptions in Test += "-Dspark.ui.enabled=false",
> {code}
> it would be a lot cleaner to simply have a 
> {code}
> var conf: SparkConf 
> {code}
> field in 
> {code}
> org.apache.spark.SparkFunSuite
> {code}
>  that has SparkConf set up with all the shared configuration that 
> `systemProperties` currently provide. Obviously this cannot be done straight 
> away given that
> many subclasses of the parent suit do this, so I think it would be best to 
> simply add a method to the parent that provides this configuration for now
> and start refactoring away duplication in other suit setups from there step 
> by step until the sys properties can be removed from the pom and sbt.build.
> This makes the build a lot easier to maintain and makes tests more readable 
> by making the environment setup more explicit in the code.
> (also it would allow running more tests straight from the IDE which is always 
> a nice thing imo)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19582) DataFrameReader conceptually inadequate

2017-02-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19582.
---
Resolution: Invalid

I don't understand what this is describing. Is it a dependency conflict? if so, 
what? You say DataFrameReader understands every possible source, but it doesn't 
of course. It is also not designed to exclude a particular data source of 
course.

You can already supply a bunch of strings to make a Dataset of strings. 

Is this about Minio? what does 'forward' mean?

Rather than reply, please continue with more clarity on the mailing list. This 
isn't clear or specific enough for a JIRA

> DataFrameReader conceptually inadequate
> ---
>
> Key: SPARK-19582
> URL: https://issues.apache.org/jira/browse/SPARK-19582
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.0
>Reporter: James Q. Arnold
>
> DataFrameReader assumes it "understands" all data sources (local file system, 
> object stores, jdbc, ...).  This seems limiting in the long term, imposing 
> both development costs to accept new sources and dependency issues for 
> existing sources (how to coordinate the XX jar for internal use vs. the XX 
> jar used by the application).  Unless I have missed how this can be done 
> currently, an application with an unsupported data source cannot create the 
> required RDD for distribution.
> I recommend at least providing a text API for supplying data.  Let the 
> application provide data as a String (or char[] or ...)---not a path, but the 
> actual data.  Alternatively, provide interfaces or abstract classes the 
> application could provide to let the application handle external data 
> sources, without forcing all that complication into the Spark implementation.
> I don't have any code to submit, but JIRA seemed like to most appropriate 
> place to raise the issue.
> Finally, if I have overlooked how this can be done with the current API, a 
> new example would be appreciated.
> Additional detail...
> We use the minio object store, which provides an API compatible with AWS-S3.  
> A few configuration/parameter values differ for minio, but one can use the 
> AWS library in the application to connect to the minio server.
> When trying to use minio objects through spark, the s3://xxx paths are 
> intercepted by spark and handed to hadoop.  So far, I have been unable to 
> find the right combination of configuration values and parameters to 
> "convince" hadoop to forward the right information to work with minio.  If I 
> could read the minio object in the application, and then hand the object 
> contents directly to spark, I could bypass hadoop and solve the problem.  
> Unfortunately, the underlying spark design prevents that.  So, I see two 
> problems.
> -  Spark seems to have taken on the responsibility of "knowing" the API 
> details of all data sources.  This seems iffy in the long run (and is the 
> root of my current problem).  In the long run, it seems unwise to assume that 
> spark should understand all possible path names, protocols, etc.  Moreover, 
> passing S3 paths to hadoop seems a little odd (why not go directly to AWS, 
> for example).  This particular confusion about S3 shows the difficulties that 
> are bound to occur.
> -  Second, spark appears not to have a way to bypass the path name 
> interpretation.  At the least, spark could provide a text/blob interface, 
> letting the application supply the data object and avoid path interpretation 
> inside spark.  Alternatively, spark could accept a reader/stream/... to build 
> the object, again letting the application provide the implementation of the 
> object input.
> As I mentioned above, I might be missing something in the API that lets us 
> work around the problem.  I'll keep looking, but the API as apparently 
> structured seems too limiting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866295#comment-15866295
 ] 

Timothy Hunter commented on SPARK-14523:


Also, the correlation is missing the multivariate case.

I will take this task over unless one expresses some interest.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2017-02-14 Thread Timothy Hunter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866288#comment-15866288
 ] 

Timothy Hunter commented on SPARK-4591:
---

[~josephkb] do you also want some subtasks for KernelDensity and multivariate 
summaries? They are in the state module but not covered.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2017-02-14 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-18541:
---

Assignee: Shea Parkes

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Assignee: Shea Parkes
>Priority: Minor
>  Labels: newbie
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API

2017-02-14 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-18541.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management 
> in pyspark SQL API
> 
>
> Key: SPARK-18541
> URL: https://issues.apache.org/jira/browse/SPARK-18541
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.0.2
> Environment: all
>Reporter: Shea Parkes
>Assignee: Shea Parkes
>Priority: Minor
>  Labels: newbie
> Fix For: 2.2.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In the Scala SQL API, you can pass in new metadata when you alias a field.  
> That functionality is not available in the Python API.   Right now, you have 
> to painfully utilize {{SparkSession.createDataFrame}} to manipulate the 
> metadata for even a single column.  I would propose to add the following 
> method to {{pyspark.sql.Column}}:
> {code}
> def aliasWithMetadata(self, name, metadata):
> """
> Make a new Column that has the provided alias and metadata.
> Metadata will be processed with json.dumps()
> """
> _context = pyspark.SparkContext._active_spark_context
> _metadata_str = json.dumps(metadata)
> _metadata_jvm = 
> _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str)
> _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm)
> return Column(_new_java_column)
> {code}
> I can likely complete this request myself if there is any interest for it.  
> Just have to dust off my knowledge of doctest and the location of the python 
> tests.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join

2017-02-14 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866262#comment-15866262
 ] 

Nick Dimiduk commented on SPARK-13219:
--

I would implement this manually by materializing the smaller relation in the 
driver and then transforming those values into a filter applied to the larger. 
Frankly I expected this to be meaning of a broadcast join. I'm wondering if I'm 
doing something to prevent the planner from performing this optimization, so 
maybe the mailing list is a more appropriate place to discuss?

> Pushdown predicate propagation in SparkSQL with join
> 
>
> Key: SPARK-13219
> URL: https://issues.apache.org/jira/browse/SPARK-13219
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.0
> Environment: Spark 1.4
> Datastax Spark connector 1.4
> Cassandra. 2.1.12
> Centos 6.6
>Reporter: Abhinav Chawade
>
> When 2 or more tables are joined in SparkSQL and there is an equality clause 
> in query on attributes used to perform the join, it is useful to apply that 
> clause on scans for both table. If this is not done, one of the tables 
> results in full scan which can reduce the query dramatically. Consider 
> following example with 2 tables being joined.
> {code}
> CREATE TABLE assets (
> assetid int PRIMARY KEY,
> address text,
> propertyname text
> )
> CREATE TABLE tenants (
> assetid int PRIMARY KEY,
> name text
> )
> spark-sql> explain select t.name from tenants t, assets a where a.assetid = 
> t.assetid and t.assetid='1201';
> WARN  2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to 
> load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> == Physical Plan ==
> Project [name#14]
>  ShuffledHashJoin [assetid#13], [assetid#15], BuildRight
>   Exchange (HashPartitioning 200)
>Filter (CAST(assetid#13, DoubleType) = 1201.0)
> HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, 
> Some(t)), None
>   Exchange (HashPartitioning 200)
>HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), 
> None
> Time taken: 1.354 seconds, Fetched 8 row(s)
> {code}
> The simple workaround is to add another equality condition for each table but 
> it becomes cumbersome. It will be helpful if the query planner could improve 
> filter propagation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan

2017-02-14 Thread Nick Dimiduk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866239#comment-15866239
 ] 

Nick Dimiduk commented on SPARK-12957:
--

[~sameerag] thanks for the comment. From a naive scan of the tickets, I believe 
I am seeing the benefits of SPARK-13871 in that a {{IsNotNull}} constraint is 
applied from the names of the join columns. However, I don't see the boon of 
SPARK-13789, specifically the {{a = 5, a = b}} mentioned in the description. My 
query is a join between a very small relation (100's of rows) and a very large 
one (10's of billions). I've hinted the planner to broadcast the smaller table, 
which it honors. After SPARK-13789, I expected to see the join column values 
pushed down as well. This is not the case.

Any tips on debugging this further? I've set breakpoints in the 
{{RelationProvider}} implementation and see that it's only receiving the 
{{IsNotNull}} filters, nothing further from the planner.

Thanks a lot!

> Derive and propagate data constrains in logical plan 
> -
>
> Key: SPARK-12957
> URL: https://issues.apache.org/jira/browse/SPARK-12957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
> Attachments: ConstraintPropagationinSparkSQL.pdf
>
>
> Based on the semantic of a query plan, we can derive data constrains (e.g. if 
> a filter defines {{a > 10}}, we know that the output data of this filter 
> satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a 
> framework to derive and propagate constrains in the logical plan, which can 
> help us to build more advanced optimizations.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19162) UserDefinedFunction constructor should verify that func is callable

2017-02-14 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19162.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> UserDefinedFunction constructor should verify that func is callable
> ---
>
> Key: SPARK-19162
> URL: https://issues.apache.org/jira/browse/SPARK-19162
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 2.2.0
>
>
> Current state
> Right now `UserDefinedFunctions` don't perform any input type validation. It 
> will accept non-callable objects just to fail with hard to understand 
> traceback:
> {code}
> In [1]: from pyspark.sql.functions import udf
> In [2]: df = spark.range(0, 1)
> In [3]: f = udf(None)
> In [4]: df.select(f()).first()
> 17/01/07 19:30:50 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> ...
> Py4JJavaError: An error occurred while calling o51.collectToPython.
> ...
> TypeError: 'NoneType' object is not callable
> ...
> {code}
> Proposed
> Apply basic validation for {{func}} argument:
> {code}
> In [7]: udf(None)
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 udf(None)
> ...
> TypeError: func should be a callable object (a function or an instance of a 
> class with __call__). Got 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 165 matches

Mail list logo