[jira] [Created] (SPARK-19605) Fail it if existing resource is not enough to run streaming job
Genmao Yu created SPARK-19605: - Summary: Fail it if existing resource is not enough to run streaming job Key: SPARK-19605 URL: https://issues.apache.org/jira/browse/SPARK-19605 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
[ https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867409#comment-15867409 ] Eyal Zituny commented on SPARK-19594: - sure, i can do that, will it make sense to fix it by marking the query id as terminated instead of removing it from the list? > StreamingQueryListener fails to handle QueryTerminatedEvent if more then one > listeners exists > - > > Key: SPARK-19594 > URL: https://issues.apache.org/jira/browse/SPARK-19594 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Eyal Zituny >Priority: Minor > > reproduce: > *create a spark session > *add multiple streaming query listeners > *create a simple query > *stop the query > result -> only the first listener handle the QueryTerminatedEvent > this might happen because the query run id is being removed from > activeQueryRunIds once the onQueryTerminated is called > (StreamingQueryListenerBus:115) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867398#comment-15867398 ] Song Jun commented on SPARK-19598: -- OK~ I'd like to do this. Thank you very much! > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16742) Kerberos support for Spark on Mesos
[ https://issues.apache.org/jira/browse/SPARK-16742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867374#comment-15867374 ] Abel Rincón commented on SPARK-16742: - Hi all we are working on a solution with hadoop delegation tokens and without proxy users, I hope that today you can take a look over the new code. > Kerberos support for Spark on Mesos > --- > > Key: SPARK-16742 > URL: https://issues.apache.org/jira/browse/SPARK-16742 > Project: Spark > Issue Type: New Feature > Components: Mesos >Reporter: Michael Gummelt > > We at Mesosphere have written Kerberos support for Spark on Mesos. We'll be > contributing it to Apache Spark soon. > Mesosphere design doc: > https://docs.google.com/document/d/1xyzICg7SIaugCEcB4w1vBWp24UDkyJ1Pyt2jtnREFqc/edit#heading=h.tdnq7wilqrj6 > Mesosphere code: > https://github.com/mesosphere/spark/commit/73ba2ab8d97510d5475ef9a48c673ce34f7173fa -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867366#comment-15867366 ] Yun Ni commented on SPARK-18392: I agree with Seth. We need to first finish SPARK-18080 and SPARK-18450 before implementing new LSH subclasses. (SPARK-18454 could be a prerequisite as well but I am not sure) [~merlin] Once the prerequisites are satisfied, you can take either BitSampling or SignRandomProjection and I can take the other one. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Comment Edited] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api
[ https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867357#comment-15867357 ] Navya Krishnappa edited comment on SPARK-19442 at 2/15/17 7:04 AM: --- Thank you [~hyukjin.kwon]. It is working as per my requirement. I could create a new column with blank values. :) was (Author: navya krishnappa): Thank you [~hyukjin.kwon]. It is satisfied my requirement. I could create a new column with blank values. :) > Unable to add column to the dataset using Dataset.WithColumn() api > -- > > Key: SPARK-19442 > URL: https://issues.apache.org/jira/browse/SPARK-19442 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When I'm creating a new column using Dataset.WithColumn() api, Analysis > Exception is thrown. > Dataset.WithColumn() api: > Dataset.withColumn("newColumnName', new > org.apache.spark.sql.Column("newColumnName").cast("int")); > Stacktrace: > cannot resolve '`NewColumn`' given input columns: [abc,xyz ] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api
[ https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867357#comment-15867357 ] Navya Krishnappa commented on SPARK-19442: -- Thank you [~hyukjin.kwon]. It is satisfied my requirement. I could create a new column with blank values. :) > Unable to add column to the dataset using Dataset.WithColumn() api > -- > > Key: SPARK-19442 > URL: https://issues.apache.org/jira/browse/SPARK-19442 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When I'm creating a new column using Dataset.WithColumn() api, Analysis > Exception is thrown. > Dataset.WithColumn() api: > Dataset.withColumn("newColumnName', new > org.apache.spark.sql.Column("newColumnName").cast("int")); > Stacktrace: > cannot resolve '`NewColumn`' given input columns: [abc,xyz ] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-19604: Assignee: Yin Huai > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Yin Huai > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19593) Records read per each kinesis transaction
[ https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867214#comment-15867214 ] Sarath Chandra Jiguru edited comment on SPARK-19593 at 2/15/17 5:52 AM: Type of the ticket is question. In KinesisReceiver.scala, currently it is possible to use default values of KinesisClientLibConfiguration. Due to this, even the stream is capable of serving the required read rate, the kinesis spark streaming consumer is not able to use it. Don't you think, it is highly needed to allow spark consumers to tweak these configurations? was (Author: sarathjiguru): See the Type of the ticket is question. In KinesisReceiver.scala, currently it is possible to use default values of KinesisClientLibConfiguration. Due to this, even the stream is capable of serving the required read rate, the kinesis spark streaming consumer is not able to use it. Don't you think, it is highly needed to allow spark consumers to tweak these configurations? > Records read per each kinesis transaction > - > > Key: SPARK-19593 > URL: https://issues.apache.org/jira/browse/SPARK-19593 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Sarath Chandra Jiguru >Priority: Trivial > > The question is related to spark streaming+kinesis integration > Is there a way to provide a kinesis consumer configuration. Ex: Number of > records read per each transaction etc. > Right now, even though, I am eligible to read 2.8G/minute, I am restricted to > read only 100MB/minute, as I am not able to increase the default number of > records in each transaction. > I have raised a question in stackoverflow as well, please look into it: > http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming > Kinesis stream setup: > open shards: 24 > write rate: 440K/minute > read rate: 1.42M/minute > read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 > Shards*2 MB*60 Seconds) > Reference: > http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19442) Unable to add column to the dataset using Dataset.WithColumn() api
[ https://issues.apache.org/jira/browse/SPARK-19442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867277#comment-15867277 ] Hyukjin Kwon commented on SPARK-19442: -- ping [~Navya Krishnappa], would this satisfy your demand? > Unable to add column to the dataset using Dataset.WithColumn() api > -- > > Key: SPARK-19442 > URL: https://issues.apache.org/jira/browse/SPARK-19442 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.2 >Reporter: Navya Krishnappa > > When I'm creating a new column using Dataset.WithColumn() api, Analysis > Exception is thrown. > Dataset.WithColumn() api: > Dataset.withColumn("newColumnName', new > org.apache.spark.sql.Column("newColumnName").cast("int")); > Stacktrace: > cannot resolve '`NewColumn`' given input columns: [abc,xyz ] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19604: Assignee: Apache Spark > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai >Assignee: Apache Spark > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19593) Records read per each kinesis transaction
[ https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867214#comment-15867214 ] Sarath Chandra Jiguru commented on SPARK-19593: --- See the Type of the ticket is question. In KinesisReceiver.scala, currently it is possible to use default values of KinesisClientLibConfiguration. Due to this, even the stream is capable of serving the required read rate, the kinesis spark streaming consumer is not able to use it. Don't you think, it is highly needed to allow spark consumers to tweak these configurations? > Records read per each kinesis transaction > - > > Key: SPARK-19593 > URL: https://issues.apache.org/jira/browse/SPARK-19593 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Sarath Chandra Jiguru >Priority: Trivial > > The question is related to spark streaming+kinesis integration > Is there a way to provide a kinesis consumer configuration. Ex: Number of > records read per each transaction etc. > Right now, even though, I am eligible to read 2.8G/minute, I am restricted to > read only 100MB/minute, as I am not able to increase the default number of > records in each transaction. > I have raised a question in stackoverflow as well, please look into it: > http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming > Kinesis stream setup: > open shards: 24 > write rate: 440K/minute > read rate: 1.42M/minute > read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 > Shards*2 MB*60 Seconds) > Reference: > http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867212#comment-15867212 ] Apache Spark commented on SPARK-19604: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/16935 > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19604) Log the start of every Python test
[ https://issues.apache.org/jira/browse/SPARK-19604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19604: Assignee: (was: Apache Spark) > Log the start of every Python test > -- > > Key: SPARK-19604 > URL: https://issues.apache.org/jira/browse/SPARK-19604 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.1.0 >Reporter: Yin Huai > > Right now, we only have info level log after we finish the tests of a Python > test file. We should also log the start of a test. So, if a test is hanging, > we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19604) Log the start of every Python test
Yin Huai created SPARK-19604: Summary: Log the start of every Python test Key: SPARK-19604 URL: https://issues.apache.org/jira/browse/SPARK-19604 Project: Spark Issue Type: Test Components: Tests Affects Versions: 2.1.0 Reporter: Yin Huai Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction
[ https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sarath Chandra Jiguru updated SPARK-19593: -- Description: The question is related to spark streaming+kinesis integration Is there a way to provide a kinesis consumer configuration. Ex: Number of records read per each transaction etc. Right now, even though, I am eligible to read 2.8G/minute, I am restricted to read only 100MB/minute, as I am not able to increase the default number of records in each transaction. I have raised a question in stackoverflow as well, please look into it: http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming Kinesis stream setup: open shards: 24 write rate: 440K/minute read rate: 1.42M/minute read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 Shards*2 MB*60 Seconds) Reference: http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html was: The question is related to spark streaming+kinesis integration Is there a way to provide a kinesis consumer configuration. Ex: Number of records read per each transaction etc. Right now, even though, I am eligible to read 2.8G/minute, I am restricted to read only 100MB/minute, as I am not able to increase the default number of records in each transaction. I have raised a question in stackoverflow as well, please look into it: http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming Kinesis stream setup: open shards: 24 write rate: 440K/minute read rate: 1.42K/minute read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 Shards*2 MB*60 Seconds) Reference: http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html > Records read per each kinesis transaction > - > > Key: SPARK-19593 > URL: https://issues.apache.org/jira/browse/SPARK-19593 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Sarath Chandra Jiguru >Priority: Trivial > > The question is related to spark streaming+kinesis integration > Is there a way to provide a kinesis consumer configuration. Ex: Number of > records read per each transaction etc. > Right now, even though, I am eligible to read 2.8G/minute, I am restricted to > read only 100MB/minute, as I am not able to increase the default number of > records in each transaction. > I have raised a question in stackoverflow as well, please look into it: > http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming > Kinesis stream setup: > open shards: 24 > write rate: 440K/minute > read rate: 1.42M/minute > read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 > Shards*2 MB*60 Seconds) > Reference: > http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867161#comment-15867161 ] Marcelo Vanzin commented on SPARK-19556: We don't generally assign bugs. Leaving a message should be enough in case anyone else was also thinking about working on it. > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19556) Broadcast data is not encrypted when I/O encryption is on
[ https://issues.apache.org/jira/browse/SPARK-19556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867152#comment-15867152 ] Genmao Yu commented on SPARK-19556: --- [~vanzin] I am working on this, could you please assign it to me? > Broadcast data is not encrypted when I/O encryption is on > - > > Key: SPARK-19556 > URL: https://issues.apache.org/jira/browse/SPARK-19556 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > {{TorrentBroadcast}} uses a couple of "back doors" into the block manager to > write and read data: > {code} > if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException(s"Failed to store $pieceId of $broadcastId > in local BlockManager") > } > {code} > {code} > bm.getLocalBytes(pieceId) match { > case Some(block) => > blocks(pid) = block > releaseLock(pieceId) > case None => > bm.getRemoteBytes(pieceId) match { > case Some(b) => > if (checksumEnabled) { > val sum = calcChecksum(b.chunks(0)) > if (sum != checksums(pid)) { > throw new SparkException(s"corrupt remote block $pieceId of > $broadcastId:" + > s" $sum != ${checksums(pid)}") > } > } > // We found the block from remote executors/driver's > BlockManager, so put the block > // in this executor's BlockManager. > if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, > tellMaster = true)) { > throw new SparkException( > s"Failed to store $pieceId of $broadcastId in local > BlockManager") > } > blocks(pid) = b > case None => > throw new SparkException(s"Failed to get $pieceId of > $broadcastId") > } > } > {code} > The thing these block manager methods have in common is that they bypass the > encryption code; so broadcast data is stored unencrypted in the block > manager, causing unencrypted data to be written to disk if those blocks need > to be evicted from memory. > The correct fix here is actually not to change {{TorrentBroadcast}}, but to > fix the block manager so that: > - data stored in memory is not encrypted > - data written to disk is encrypted > This would simplify the code paths that use BlockManager / SerializerManager > APIs (e.g. see SPARK-19520), but requires some tricky changes inside the > BlockManager to still be able to use file channels to avoid reading whole > blocks back into memory so they can be decrypted. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18113) Sending AskPermissionToCommitOutput failed, driver enter into task deadloop
[ https://issues.apache.org/jira/browse/SPARK-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867124#comment-15867124 ] xukun commented on SPARK-18113: --- [~aash] According my scenario and [https://github.com/palantir/spark/pull/94] code task 678.0 outputCommitCoordinator.canCommit will match CommitState(NO_AUTHORIZED_COMMITTER, _, Uncommitted) => CommitState(attemptNumber, System.nanoTime(), MidCommit) outputCommitCoordinator.commitDone match CommitState(existingCommitter, startTime, MidCommit) if attemptNumber == existingCommitter => CommitState(attemptNumber, startTime, Committed) task 678.1 outputCommitCoordinator.canCommit match CommitState(existingCommitter, _, Committed) then driver enter into task deadloop > Sending AskPermissionToCommitOutput failed, driver enter into task deadloop > --- > > Key: SPARK-18113 > URL: https://issues.apache.org/jira/browse/SPARK-18113 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.1 > Environment: # cat /etc/redhat-release > Red Hat Enterprise Linux Server release 7.2 (Maipo) >Reporter: xuqing >Assignee: jin xing > Fix For: 2.2.0 > > > Executor sends *AskPermissionToCommitOutput* to driver failed, and retry > another sending. Driver receives 2 AskPermissionToCommitOutput messages and > handles them. But executor ignores the first response(true) and receives the > second response(false). The TaskAttemptNumber for this partition in > authorizedCommittersByStage is locked forever. Driver enters into infinite > loop. > h4. Driver Log: > {noformat} > 16/10/25 05:38:28 INFO TaskSetManager: Starting task 24.0 in stage 2.0 (TID > 110, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/25 05:39:00 WARN TaskSetManager: Lost task 24.0 in stage 2.0 (TID 110, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 0 > ... > 16/10/25 05:39:00 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 0 > ... > 16/10/26 15:53:03 INFO TaskSetManager: Starting task 24.1 in stage 2.0 (TID > 119, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > 16/10/26 15:53:05 WARN TaskSetManager: Lost task 24.1 in stage 2.0 (TID 119, > cwss04.sh01.com): TaskCommitDenied (Driver denied task commit) for job: 2, > partition: 24, attemptNumber: 1 > 16/10/26 15:53:05 INFO OutputCommitCoordinator: Task was denied committing, > stage: 2, partition: 24, attempt: 1 > ... > 16/10/26 15:53:05 INFO TaskSetManager: Starting task 24.28654 in stage 2.0 > (TID 28733, cwss04.sh01.com, partition 24, PROCESS_LOCAL, 5248 bytes) > ... > {noformat} > h4. Executor Log: > {noformat} > ... > 16/10/25 05:38:42 INFO Executor: Running task 24.0 in stage 2.0 (TID 110) > ... > 16/10/25 05:39:10 WARN NettyRpcEndpointRef: Error sending message [message = > AskPermissionToCommitOutput(2,24,0)] in 1 attempts > org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 > seconds]. This timeout is controlled by spark.rpc.askTimeout > at > org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78) > at > org.apache.spark.scheduler.OutputCommitCoordinator.canCommit(OutputCommitCoordinator.scala:95) > at > org.apache.spark.mapred.SparkHadoopMapRedUtil$.commitTask(SparkHadoopMapRedUtil.scala:73) > at > org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:106) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1212) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:279) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867064#comment-15867064 ] Mingjie Tang commented on SPARK-18392: -- Sure, AND-amp is important and basic for current LSH. We can allocate the efforts for AND-amplification and new hash functions. \cc [~yunn] > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867049#comment-15867049 ] Seth Hendrickson commented on SPARK-18392: -- I would pretty strongly prefer to focus on adding AND-amplification before adding anything else to LSH. That is more of a missing part of the functionality, where as other things are enhancements. Curious to hear others' thoughts on this. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-18392) LSH API, algorithm, and documentation follow-ups
[ https://issues.apache.org/jira/browse/SPARK-18392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867041#comment-15867041 ] mingjie tang commented on SPARK-18392: -- [~yunn] are you working on the BitSampling & SignRandomProjection function, if not, I can work on them this week. > LSH API, algorithm, and documentation follow-ups > > > Key: SPARK-18392 > URL: https://issues.apache.org/jira/browse/SPARK-18392 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA summarizes discussions from the initial LSH PR > [https://github.com/apache/spark/pull/15148] as well as the follow-up for > hash distance [https://github.com/apache/spark/pull/15800]. This will be > broken into subtasks: > * API changes (targeted for 2.1) > * algorithmic fixes (targeted for 2.1) > * documentation improvements (ideally 2.1, but could slip) > The major issues we have mentioned are as follows: > * OR vs AND amplification > ** Need to make API flexible enough to support both types of amplification in > the future > ** Need to clarify which we support, including in each model function > (transform, similarity join, neighbors) > * Need to clarify which algorithms we have implemented, improve docs and > references, and fix the algorithms if needed. > These major issues are broken down into detailed issues below. > h3. LSH abstraction > * Rename {{outputDim}} to something indicative of OR-amplification. > ** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used > in the future for AND amplification (Thanks [~mlnick]!) > * transform > ** Update output schema to {{Array of Vector}} instead of {{Vector}}. This > is the "raw" output of all hash functions, i.e., with no aggregation for > amplification. > ** Clarify meaning of output in terms of multiple hash functions and > amplification. > ** Note: We will _not_ worry about users using this output for dimensionality > reduction; if anything, that use case can be explained in the User Guide. > * Documentation > ** Clarify terminology used everywhere > *** hash function {{h_i}}: basic hash function without amplification > *** hash value {{h_i(key)}}: output of a hash function > *** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with > AND-amplification using K base hash functions > *** compound hash function value {{g(key)}}: vector-valued output > *** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with > OR-amplification using L compound hash functions > *** hash table value {{H(key)}}: output of array of vectors > *** This terminology is largely pulled from Wang et al.'s survey and the > multi-probe LSH paper. > ** Link clearly to documentation (Wikipedia or papers) which matches our > terminology and what we implemented > h3. RandomProjection (or P-Stable Distributions) > * Rename {{RandomProjection}} > ** Options include: {{ScalarRandomProjectionLSH}}, > {{BucketedRandomProjectionLSH}}, {{PStableLSH}} > * API privacy > ** Make randUnitVectors private > * hashFunction > ** Currently, this uses OR-amplification for single probing, as we intended. > ** It does *not* do multiple probing, at least not in the sense of the > original MPLSH paper. We should fix that or at least document its behavior. > * Documentation > ** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia > ** Also link to the multi-probe LSH paper since that explains this method > very clearly. > ** Clarify hash function and distance metric > h3. MinHash > * Rename {{MinHash}} -> {{MinHashLSH}} > * API privacy > ** Make randCoefficients, numEntries private > * hashDistance (used in approxNearestNeighbors) > ** Update to use average of indicators of hash collisions [SPARK-18334] > ** See [Wikipedia | > https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a > reference > h3. All references > I'm just listing references I looked at. > Papers > * [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf] > * [https://people.csail.mit.edu/indyk/p117-andoni.pdf] > * [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf] > * [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe > LSH paper > Wikipedia > * > [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search] > * [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19603) Fix StreamingQuery explain command
[ https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867037#comment-15867037 ] Apache Spark commented on SPARK-19603: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16934 > Fix StreamingQuery explain command > -- > > Key: SPARK-19603 > URL: https://issues.apache.org/jira/browse/SPARK-19603 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now StreamingQuery.explain doesn't show the correct streaming physical > plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19603) Fix StreamingQuery explain command
[ https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19603: Assignee: Apache Spark (was: Shixiong Zhu) > Fix StreamingQuery explain command > -- > > Key: SPARK-19603 > URL: https://issues.apache.org/jira/browse/SPARK-19603 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > Right now StreamingQuery.explain doesn't show the correct streaming physical > plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19603) Fix StreamingQuery explain command
[ https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19603: Assignee: Shixiong Zhu (was: Apache Spark) > Fix StreamingQuery explain command > -- > > Key: SPARK-19603 > URL: https://issues.apache.org/jira/browse/SPARK-19603 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now StreamingQuery.explain doesn't show the correct streaming physical > plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19603) Fix StreamingQuery explain command
[ https://issues.apache.org/jira/browse/SPARK-19603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19603: - Summary: Fix StreamingQuery explain command (was: Fix the stream explain command) > Fix StreamingQuery explain command > -- > > Key: SPARK-19603 > URL: https://issues.apache.org/jira/browse/SPARK-19603 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > Right now StreamingQuery.explain doesn't show the correct streaming physical > plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19603) Fix the stream explain command
Shixiong Zhu created SPARK-19603: Summary: Fix the stream explain command Key: SPARK-19603 URL: https://issues.apache.org/jira/browse/SPARK-19603 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0, 2.0.2 Reporter: Shixiong Zhu Assignee: Shixiong Zhu Right now StreamingQuery.explain doesn't show the correct streaming physical plan. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867024#comment-15867024 ] Sunitha Kambhampati commented on SPARK-19602: - Attaching the design doc and the proposed solution. I'd appreciate any feedback/suggestions. Also, I will look into submitting a PR soon. > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Attachment: Design_ColResolution_JIRA19602.docx > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > Attachments: Design_ColResolution_JIRA19602.docx > > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Description: 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, db2.t1 Most of the other database systems support this ( e.g DB2, Oracle, MySQL). Note: In DB2, Oracle, the notion is of .. 2) Another scenario where this fully qualified name is useful is as follows: // current database is db1. select t1.i1 from t1, db2.t1 If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an error during column resolution in the analyzer, as it is ambiguous. Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 only has i1 column. The query would still succeed instead of throwing an error. One way to avoid confusion would be to explicitly specify using the fully qualified name db1.t1.i1 For e.g: select db1.t1.i1 from t1, db2.t1 Workarounds: There is a workaround for these issues, which is to use an alias. was: 1) Spark SQL fails to analyze this query: {quote} select db1.t1.i1 from db1.t1, db2.t1 {quote} - Most of the other database systems support this ( e.g DB2, Oracle, MySQL). - Note: In DB2, Oracle, the notion is of .. 2) Another scenario where this fully qualified name is useful is as follows: {quote} // current database is db1. select t1.i1 from t1, db2.t1 {quote} - If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an error during column resolution in the analyzer, as it is ambiguous. - Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 only has i1 column. The query would still succeed instead of throwing an error. - One way to avoid confusion would be to explicitly specify using the fully qualified name db1.t1.i1 For e.g: select db1.t1.i1 from t1, db2.t1 Workarounds: There is a workaround for these issues, which is to use an alias. > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > > 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, > db2.t1 > Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > // current database is db1. > select t1.i1 from t1, db2.t1 > If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an > error during column resolution in the analyzer, as it is ambiguous. > Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
[ https://issues.apache.org/jira/browse/SPARK-19602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sunitha Kambhampati updated SPARK-19602: Description: 1) Spark SQL fails to analyze this query: {quote} select db1.t1.i1 from db1.t1, db2.t1 {quote} - Most of the other database systems support this ( e.g DB2, Oracle, MySQL). - Note: In DB2, Oracle, the notion is of .. 2) Another scenario where this fully qualified name is useful is as follows: {quote} // current database is db1. select t1.i1 from t1, db2.t1 {quote} - If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an error during column resolution in the analyzer, as it is ambiguous. - Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 only has i1 column. The query would still succeed instead of throwing an error. - One way to avoid confusion would be to explicitly specify using the fully qualified name db1.t1.i1 For e.g: select db1.t1.i1 from t1, db2.t1 Workarounds: There is a workaround for these issues, which is to use an alias. was: 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, db2.t1 Most of the other database systems support this ( e.g DB2, Oracle, MySQL). Note: In DB2, Oracle, the notion is of .. 2) Another scenario where this fully qualified name is useful is as follows: // current database is db1. select t1.i1 from t1, db2.t1 If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an error during column resolution in the analyzer, as it is ambiguous. Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 only has i1 column. The query would still succeed instead of throwing an error. One way to avoid confusion would be to explicitly specify using the fully qualified name db1.t1.i1 For e.g: select db1.t1.i1 from t1, db2.t1 Workarounds: There is a workaround for these issues, which is to use an alias. > Unable to query using the fully qualified column name of the form ( > ..) > -- > > Key: SPARK-19602 > URL: https://issues.apache.org/jira/browse/SPARK-19602 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sunitha Kambhampati > > 1) Spark SQL fails to analyze this query: > {quote} > select db1.t1.i1 from db1.t1, db2.t1 > {quote} > - Most of the other database systems support this ( e.g DB2, Oracle, MySQL). > - Note: In DB2, Oracle, the notion is of .. > 2) Another scenario where this fully qualified name is useful is as follows: > {quote} > // current database is db1. > select t1.i1 from t1, db2.t1 > {quote} > - If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw > an error during column resolution in the analyzer, as it is ambiguous. > - Lets say the user intended to retrieve i1 from db1.t1 but in the example > db2.t1 only has i1 column. The query would still succeed instead of throwing > an error. > - One way to avoid confusion would be to explicitly specify using the fully > qualified name db1.t1.i1 > For e.g: select db1.t1.i1 from t1, db2.t1 > Workarounds: > There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19602) Unable to query using the fully qualified column name of the form ( ..)
Sunitha Kambhampati created SPARK-19602: --- Summary: Unable to query using the fully qualified column name of the form ( ..) Key: SPARK-19602 URL: https://issues.apache.org/jira/browse/SPARK-19602 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Sunitha Kambhampati 1) Spark SQL fails to analyze this query: select db1.t1.i1 from db1.t1, db2.t1 Most of the other database systems support this ( e.g DB2, Oracle, MySQL). Note: In DB2, Oracle, the notion is of .. 2) Another scenario where this fully qualified name is useful is as follows: // current database is db1. select t1.i1 from t1, db2.t1 If the i1 column exists in both tables: db1.t1 and db2.t1, this will throw an error during column resolution in the analyzer, as it is ambiguous. Lets say the user intended to retrieve i1 from db1.t1 but in the example db2.t1 only has i1 column. The query would still succeed instead of throwing an error. One way to avoid confusion would be to explicitly specify using the fully qualified name db1.t1.i1 For e.g: select db1.t1.i1 from t1, db2.t1 Workarounds: There is a workaround for these issues, which is to use an alias. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-14894. -- Resolution: Duplicate [~wangmiao1981], I guess we could take an action to JIRA too if we are very sure according to http://spark.apache.org/contributing.html {quote} Most contributors are able to directly resolve JIRAs. Use judgment in determining whether you are quite confident the issue should be resolved, although changes can be easily undone. {quote} > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab
[ https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15867000#comment-15867000 ] Ruslan Dautkhanov commented on SPARK-19588: --- Got it. Thanks [~vanzin] > Allow putting keytab file to HDFS location specified in spark.yarn.keytab > - > > Key: SPARK-19588 > URL: https://issues.apache.org/jira/browse/SPARK-19588 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 2.0.2, 2.1.0 > Environment: kerberized cluster, Spark 2 >Reporter: Ruslan Dautkhanov > Labels: authentication, kerberos, security, yarn-client > > As a workaround for SPARK-19038 tried putting keytab in user's home directory > in HDFS but this fails with > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Keytab file: > hdfs:///user/svc_odiprd/.kt does not exist > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} > This is yarn-client mode, so driver probably can't see HDFS while submitting > a job; although I suspect it doesn't not only with yarn-client. > Would be great to support reading keytab for kerberos ticket renewals > directly from HDFS. > We think that in some scenarios it's more secure than referencing a keytab > from a local fs on a client machine that does a spark-submit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14894) Python GaussianMixture summary
[ https://issues.apache.org/jira/browse/SPARK-14894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866999#comment-15866999 ] Miao Wang commented on SPARK-14894: --- This is a dup of JIRA-18282. Should be closed. > Python GaussianMixture summary > -- > > Key: SPARK-14894 > URL: https://issues.apache.org/jira/browse/SPARK-14894 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.ml, GaussianMixture includes a result summary. The Python API > should provide the same functionality. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19528) external shuffle service would close while still have request from executor when dynamic allocation is enabled
[ https://issues.apache.org/jira/browse/SPARK-19528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866962#comment-15866962 ] satheessh commented on SPARK-19528: --- I am also getting same error from container " ERROR client.TransportResponseHandler: Still have 1 requests outstanding when connection from" Node manager Logs: at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, chunkIndex=79}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/23/shuffle_1_46802_0.data, offset=59855197, length=32076}} to /172.20.96.35:39880; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, chunkIndex=80}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/23/shuffle_1_48408_0.data, offset=61388120, length=38889}} to /172.20.96.35:39880; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, chunkIndex=81}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/3d/shuffle_1_48518_0.data, offset=61130291, length=35265}} to /172.20.96.35:39880; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, chunkIndex=82}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/2d/shuffle_1_48640_0.data, offset=77145914, length=61274}} to /172.20.96.35:39880; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-21): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027031288, chunkIndex=83}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-3052fe16-feda-4555-8ee5-0b728c9ea738/15/shuffle_1_49038_0.data, offset=62340136, length=39151}} to /172.20.96.35:39880; closing connection java.nio.channels.ClosedChannelException at org.spark_project.io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) 2017-02-14 19:27:36,300 ERROR org.apache.spark.network.server.TransportRequestHandler (shuffle-server-2-17): Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1207027030512, chunkIndex=3}, buffer=FileSegmentManagedBuffer{file=/mnt/yarn/usercache/hadoop/appcache/application_1487039710840_0006/blockmgr-37dc54f8-1f23-4fc4-bf5e-d0b9a1837867/29/shuffle_1_95_0.data, offset=1442114, length=37869}} to /172.20.101.23:37866; closing connection java.io.IOException: Broken pipe at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608) at org.spark_project.io.netty.channel.DefaultFileRegion.transferTo(DefaultFileRegion.java:139) at org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:121) at org.spark_project.io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:287) at org.spark_project.io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237) at org.spark_project.io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:314) at
[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab
[ https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866932#comment-15866932 ] Marcelo Vanzin commented on SPARK-19588: bq. driver/yarn#client holds keytab just to distribute it to executors, isn't it? Not currently, it actually tries to login with the keytab before submitting the application. I'm not a fan of that behavior (it confuses a lot of people who try to use it as a generic kerberos login for Spark, and run into issues because there seem to be a few outstanding issues with the code), but it's a little questionable whether it can be changed now (since some people might be relying on it). > Allow putting keytab file to HDFS location specified in spark.yarn.keytab > - > > Key: SPARK-19588 > URL: https://issues.apache.org/jira/browse/SPARK-19588 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 2.0.2, 2.1.0 > Environment: kerberized cluster, Spark 2 >Reporter: Ruslan Dautkhanov > Labels: authentication, kerberos, security, yarn-client > > As a workaround for SPARK-19038 tried putting keytab in user's home directory > in HDFS but this fails with > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Keytab file: > hdfs:///user/svc_odiprd/.kt does not exist > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} > This is yarn-client mode, so driver probably can't see HDFS while submitting > a job; although I suspect it doesn't not only with yarn-client. > Would be great to support reading keytab for kerberos ticket renewals > directly from HDFS. > We think that in some scenarios it's more secure than referencing a keytab > from a local fs on a client machine that does a spark-submit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`
[ https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19318. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16891 [https://github.com/apache/spark/pull/16891] > Docker test case failure: `SPARK-16625: General data types to be mapped to > Oracle` > -- > > Key: SPARK-19318 > URL: https://issues.apache.org/jira/browse/SPARK-19318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li > Fix For: 2.2.0 > > > = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General > data types to be mapped to Oracle' = > - SPARK-16625: General data types to be mapped to Oracle *** FAILED *** > types.apply(9).equals("class java.sql.Date") was false > (OracleIntegrationSuite.scala:136) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19318) Docker test case failure: `SPARK-16625: General data types to be mapped to Oracle`
[ https://issues.apache.org/jira/browse/SPARK-19318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19318: --- Assignee: Suresh Thalamati > Docker test case failure: `SPARK-16625: General data types to be mapped to > Oracle` > -- > > Key: SPARK-19318 > URL: https://issues.apache.org/jira/browse/SPARK-19318 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Suresh Thalamati > Fix For: 2.2.0 > > > = FINISHED o.a.s.sql.jdbc.OracleIntegrationSuite: 'SPARK-16625: General > data types to be mapped to Oracle' = > - SPARK-16625: General data types to be mapped to Oracle *** FAILED *** > types.apply(9).equals("class java.sql.Date") was false > (OracleIntegrationSuite.scala:136) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19588) Allow putting keytab file to HDFS location specified in spark.yarn.keytab
[ https://issues.apache.org/jira/browse/SPARK-19588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866839#comment-15866839 ] Ruslan Dautkhanov commented on SPARK-19588: --- driver/yarn#client holds keytab just to distribute it to executors, isn't it? If so, then we don't need to "download" to local disk for driver/yarn#client as executors will already have access to keytab in HDFS. Am I missing something? > Allow putting keytab file to HDFS location specified in spark.yarn.keytab > - > > Key: SPARK-19588 > URL: https://issues.apache.org/jira/browse/SPARK-19588 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Spark Submit >Affects Versions: 2.0.2, 2.1.0 > Environment: kerberized cluster, Spark 2 >Reporter: Ruslan Dautkhanov > Labels: authentication, kerberos, security, yarn-client > > As a workaround for SPARK-19038 tried putting keytab in user's home directory > in HDFS but this fails with > {noformat} > Exception in thread "main" org.apache.spark.SparkException: Keytab file: > hdfs:///user/svc_odiprd/.kt does not exist > at > org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:555) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:158) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:124) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {noformat} > This is yarn-client mode, so driver probably can't see HDFS while submitting > a job; although I suspect it doesn't not only with yarn-client. > Would be great to support reading keytab for kerberos ticket renewals > directly from HDFS. > We think that in some scenarios it's more secure than referencing a keytab > from a local fs on a client machine that does a spark-submit. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19275) Spark Streaming, Kafka receiver, "Failed to get records for ... after polling for 512"
[ https://issues.apache.org/jira/browse/SPARK-19275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun resolved SPARK-19275. - Resolution: Not A Problem > Spark Streaming, Kafka receiver, "Failed to get records for ... after polling > for 512" > -- > > Key: SPARK-19275 > URL: https://issues.apache.org/jira/browse/SPARK-19275 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.0 > Environment: Apache Spark 2.0.0, Kafka 0.10 for Scala 2.11 >Reporter: Dmitry Ochnev > > We have a Spark Streaming application reading records from Kafka 0.10. > Some tasks are failed because of the following error: > "java.lang.AssertionError: assertion failed: Failed to get records for (...) > after polling for 512" > The first attempt fails and the second attempt (retry) completes > successfully, - this is the pattern that we see for many tasks in our logs. > These fails and retries consume resources. > A similar case with a stack trace are described here: > https://www.mail-archive.com/user@spark.apache.org/msg56564.html > https://gist.github.com/SrikanthTati/c2e95c4ac689cd49aab817e24ec42767 > Here is the line from the stack trace where the error is raised: > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > We tried several values for "spark.streaming.kafka.consumer.poll.ms", - 2, 5, > 10, 30 and 60 seconds, but the error appeared in all the cases except the > last one. Moreover, increasing the threshold led to increasing total Spark > stage duration. > In other words, increasing "spark.streaming.kafka.consumer.poll.ms" led to > fewer task failures but with cost of total stage duration. So, it is bad for > performance when processing data streams. > We have a suspicion that there is a bug in CachedKafkaConsumer (and/or other > related classes) which inhibits the reading process. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16475) Broadcast Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-16475: --- Assignee: Reynold Xin > Broadcast Hint for SQL Queries > -- > > Key: SPARK-16475 > URL: https://issues.apache.org/jira/browse/SPARK-16475 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: BroadcastHintinSparkSQL.pdf > > > Broadcast hint is a way for users to manually annotate a query and suggest to > the query optimizer the join method. It is very useful when the query > optimizer cannot make optimal decision with respect to join methods due to > conservativeness or the lack of proper statistics. > The DataFrame API has broadcast hint since Spark 1.5. However, we do not have > an equivalent functionality in SQL queries. We propose adding Hive-style > broadcast hint to Spark SQL. > For more information, please see the attached document. One note about the > doc: in addition to supporting "MAPJOIN", we should also support > "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be > accepted: > {code} > SELECT /*+ MAPJOIN(b) */ ... > SELECT /*+ BROADCASTJOIN(b) */ ... > SELECT /*+ BROADCAST(b) */ ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16475) Broadcast Hint for SQL Queries
[ https://issues.apache.org/jira/browse/SPARK-16475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-16475. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16925 [https://github.com/apache/spark/pull/16925] > Broadcast Hint for SQL Queries > -- > > Key: SPARK-16475 > URL: https://issues.apache.org/jira/browse/SPARK-16475 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin > Labels: releasenotes > Fix For: 2.2.0 > > Attachments: BroadcastHintinSparkSQL.pdf > > > Broadcast hint is a way for users to manually annotate a query and suggest to > the query optimizer the join method. It is very useful when the query > optimizer cannot make optimal decision with respect to join methods due to > conservativeness or the lack of proper statistics. > The DataFrame API has broadcast hint since Spark 1.5. However, we do not have > an equivalent functionality in SQL queries. We propose adding Hive-style > broadcast hint to Spark SQL. > For more information, please see the attached document. One note about the > doc: in addition to supporting "MAPJOIN", we should also support > "BROADCASTJOIN" and "BROADCAST" in the comment, e.g. the following should be > accepted: > {code} > SELECT /*+ MAPJOIN(b) */ ... > SELECT /*+ BROADCASTJOIN(b) */ ... > SELECT /*+ BROADCAST(b) */ ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction
[ https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19593: - Priority: Trivial (was: Critical) > Records read per each kinesis transaction > - > > Key: SPARK-19593 > URL: https://issues.apache.org/jira/browse/SPARK-19593 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Sarath Chandra Jiguru >Priority: Trivial > > The question is related to spark streaming+kinesis integration > Is there a way to provide a kinesis consumer configuration. Ex: Number of > records read per each transaction etc. > Right now, even though, I am eligible to read 2.8G/minute, I am restricted to > read only 100MB/minute, as I am not able to increase the default number of > records in each transaction. > I have raised a question in stackoverflow as well, please look into it: > http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming > Kinesis stream setup: > open shards: 24 > write rate: 440K/minute > read rate: 1.42K/minute > read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 > Shards*2 MB*60 Seconds) > Reference: > http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19593) Records read per each kinesis transaction
[ https://issues.apache.org/jira/browse/SPARK-19593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19593: - Component/s: (was: Structured Streaming) (was: Spark Core) DStreams > Records read per each kinesis transaction > - > > Key: SPARK-19593 > URL: https://issues.apache.org/jira/browse/SPARK-19593 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.0.1 >Reporter: Sarath Chandra Jiguru >Priority: Critical > > The question is related to spark streaming+kinesis integration > Is there a way to provide a kinesis consumer configuration. Ex: Number of > records read per each transaction etc. > Right now, even though, I am eligible to read 2.8G/minute, I am restricted to > read only 100MB/minute, as I am not able to increase the default number of > records in each transaction. > I have raised a question in stackoverflow as well, please look into it: > http://stackoverflow.com/questions/42107037/how-to-alter-kinesis-consumer-properties-in-spark-streaming > Kinesis stream setup: > open shards: 24 > write rate: 440K/minute > read rate: 1.42K/minute > read byte rate: 85 MB/minute. I am allowed to read around 2.8G/minute(24 > Shards*2 MB*60 Seconds) > Reference: > http://docs.aws.amazon.com/streams/latest/dev/kinesis-record-processor-additional-considerations.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19594) StreamingQueryListener fails to handle QueryTerminatedEvent if more then one listeners exists
[ https://issues.apache.org/jira/browse/SPARK-19594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866778#comment-15866778 ] Shixiong Zhu commented on SPARK-19594: -- Good catch. Would you like to submit a PR to fix it? > StreamingQueryListener fails to handle QueryTerminatedEvent if more then one > listeners exists > - > > Key: SPARK-19594 > URL: https://issues.apache.org/jira/browse/SPARK-19594 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Eyal Zituny >Priority: Minor > > reproduce: > *create a spark session > *add multiple streaming query listeners > *create a simple query > *stop the query > result -> only the first listener handle the QueryTerminatedEvent > this might happen because the query run id is being removed from > activeQueryRunIds once the onQueryTerminated is called > (StreamingQueryListenerBus:115) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19387) CRAN tests do not run with SparkR source package
[ https://issues.apache.org/jira/browse/SPARK-19387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-19387. --- Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 Issue resolved by pull request 16720 [https://github.com/apache/spark/pull/16720] > CRAN tests do not run with SparkR source package > > > Key: SPARK-19387 > URL: https://issues.apache.org/jira/browse/SPARK-19387 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Felix Cheung > Fix For: 2.1.1, 2.2.0 > > > It looks like sparkR.session() is not installing Spark - as a result, running > R CMD check --as-cran SparkR_*.tar.gz fails, blocking possible submission to > CRAN. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866768#comment-15866768 ] Timothy Hunter commented on SPARK-19208: Yes, I meant returning a struct and then projecting this struct. I do not think there is any other way right now with the current UDAFs, as you mention. In that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, the fields of which are decided by the arguments in {{.metrics}}, and each of the individual functions {{VectorSummarizer.min/max/variasce(...)}} returns columns of vectors or matrices. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath commented on SPARK-19208: Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {{code}} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {{code}} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866755#comment-15866755 ] Nick Pentreath edited comment on SPARK-19208 at 2/14/17 9:42 PM: - Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {code} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {code} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? was (Author: mlnick): Ah right I see - yes rewrite rules would be a good ultimate goal. One question I have - if we do: {{code}} val summary = df.select(VectorSummarizer.metrics("min", "max").summary("features")) {{code}} How will we return a DF with cols {{min}} and {{max}}? Since it seems multiple return cols are not supported by UDAF? Or do we have to live with the struct return type for now until we could do the rewrite version? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19523: - Issue Type: Question (was: Improvement) > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Question > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19523) Spark streaming+ insert into table leaves bunch of trash in table directory
[ https://issues.apache.org/jira/browse/SPARK-19523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19523. -- Resolution: Not A Bug > Spark streaming+ insert into table leaves bunch of trash in table directory > --- > > Key: SPARK-19523 > URL: https://issues.apache.org/jira/browse/SPARK-19523 > Project: Spark > Issue Type: Improvement > Components: DStreams, SQL >Affects Versions: 2.0.2 >Reporter: Egor Pahomov >Priority: Minor > > I have very simple code, which transform coming json files into pq table: > {code} > import org.apache.spark.sql.hive.HiveContext > import org.apache.hadoop.fs.Path > import org.apache.hadoop.io.{LongWritable, Text} > import org.apache.hadoop.mapreduce.lib.input.TextInputFormat > import org.apache.spark.sql.SaveMode > object Client_log { > def main(args: Array[String]): Unit = { > val resultCols = new HiveContext(Spark.ssc.sparkContext).sql(s"select * > from temp.x_streaming where year=2015 and month=12 and day=1").dtypes > var columns = resultCols.filter(x => > !Commons.stopColumns.contains(x._1)).map({ case (name, types) => { > s"""cast (get_json_object(s, '""" + '$' + s""".properties.${name}') as > ${Commons.mapType(types)}) as $name""" > } > }) > columns ++= List("'streaming' as sourcefrom") > def f(path:Path): Boolean = { > true > } > val client_log_d_stream = Spark.ssc.fileStream[LongWritable, Text, > TextInputFormat]("/user/egor/test2", f _ , newFilesOnly = false) > client_log_d_stream.foreachRDD(rdd => { > val localHiveContext = new HiveContext(rdd.sparkContext) > import localHiveContext.implicits._ > var input = rdd.map(x => Record(x._2.toString)).toDF() > input = input.selectExpr(columns: _*) > input = > SmallOperators.populate(input, resultCols) > input > .write > .mode(SaveMode.Append) > .format("parquet") > .insertInto("temp.x_streaming") > }) > Spark.ssc.start() > Spark.ssc.awaitTermination() > } > case class Record(s: String) > } > {code} > This code generates a lot of trash directories in resalt table like: > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_298_7130707897870357017-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-00_309_6225285476054854579-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_305_2185311414031328806-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-06_309_6331022557673464922-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_334_1333065569942957405-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-12_387_3622176537686712754-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_339_1008134657443203932-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-18_421_3284019142681396277-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_291_5985064758831763168-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-24_300_6751765745457248879-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_314_2987765230093671316-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-30_331_2746678721907502111-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_311_1466065813702202959-1 > drwxrwxrwt 3 egor nobody 4096 Feb 8 14:15 > .hive-staging_hive_2017-02-08_14-15-36_317_7079974647544197072-1 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19497) dropDuplicates with watermark
[ https://issues.apache.org/jira/browse/SPARK-19497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-19497: Assignee: Shixiong Zhu > dropDuplicates with watermark > - > > Key: SPARK-19497 > URL: https://issues.apache.org/jira/browse/SPARK-19497 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Michael Armbrust >Assignee: Shixiong Zhu >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM: - Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), col("vector_summary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? was (Author: timhunter): Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714 ] Timothy Hunter commented on SPARK-19208: Thanks for the clarification [~mlnick]. I was a bit unclear in my previous comment. What I meant by catalyst rules is supporting the case in which the user would naturally request multiple summaries: {code} val summaryDF = df.select(VectorSummary.min("features"), VectorSummary.variance("features")) {code} and have a simple rule that rewrites this logical tree to use a single UDAF under the hood: {code} val tmpDF = df.select(VectorSummary.summary("features", "min", "variance")) val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), col("VectorSummary(features).variance").as("variance(features)") {code} Of course this is more advanced, and we should probably start with: - a UDAF that follows some builder pattern such as VectorSummarizer.metrics("min", "max").summary("features") - some simple wrappers that (inefficiently) compute independently their statistics: {{VectorSummarizer.min("feature")}} is a shortcut for: {code} VectorSummarizer.metrics("min").summary("features").getCol("min") {code} etc. We can always optimize this use case later using rewrite rules. What do you think? > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
[ https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19601: Assignee: Apache Spark (was: Xiao Li) > Fix CollapseRepartition rule to preserve shuffle-enabled Repartition > > > Key: SPARK-19601 > URL: https://issues.apache.org/jira/browse/SPARK-19601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Apache Spark > > When users use the shuffle-enabled `repartition` API, they expect the > partition they got should be the exact number they provided, even if they > call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule > does not consider whether shuffle is enabled or not. Thus, we got the > following unexpected result. > {noformat} > val df = spark.range(0, 1, 1, 5) > val df2 = df.repartition(10) > assert(df2.coalesce(13).rdd.getNumPartitions == 5) > assert(df2.coalesce(7).rdd.getNumPartitions == 5) > assert(df2.coalesce(3).rdd.getNumPartitions == 3) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
[ https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866709#comment-15866709 ] Apache Spark commented on SPARK-19601: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/16933 > Fix CollapseRepartition rule to preserve shuffle-enabled Repartition > > > Key: SPARK-19601 > URL: https://issues.apache.org/jira/browse/SPARK-19601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When users use the shuffle-enabled `repartition` API, they expect the > partition they got should be the exact number they provided, even if they > call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule > does not consider whether shuffle is enabled or not. Thus, we got the > following unexpected result. > {noformat} > val df = spark.range(0, 1, 1, 5) > val df2 = df.repartition(10) > assert(df2.coalesce(13).rdd.getNumPartitions == 5) > assert(df2.coalesce(7).rdd.getNumPartitions == 5) > assert(df2.coalesce(3).rdd.getNumPartitions == 3) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
[ https://issues.apache.org/jira/browse/SPARK-19601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19601: Assignee: Xiao Li (was: Apache Spark) > Fix CollapseRepartition rule to preserve shuffle-enabled Repartition > > > Key: SPARK-19601 > URL: https://issues.apache.org/jira/browse/SPARK-19601 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > When users use the shuffle-enabled `repartition` API, they expect the > partition they got should be the exact number they provided, even if they > call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule > does not consider whether shuffle is enabled or not. Thus, we got the > following unexpected result. > {noformat} > val df = spark.range(0, 1, 1, 5) > val df2 = df.repartition(10) > assert(df2.coalesce(13).rdd.getNumPartitions == 5) > assert(df2.coalesce(7).rdd.getNumPartitions == 5) > assert(df2.coalesce(3).rdd.getNumPartitions == 3) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19601) Fix CollapseRepartition rule to preserve shuffle-enabled Repartition
Xiao Li created SPARK-19601: --- Summary: Fix CollapseRepartition rule to preserve shuffle-enabled Repartition Key: SPARK-19601 URL: https://issues.apache.org/jira/browse/SPARK-19601 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.2 Reporter: Xiao Li Assignee: Xiao Li When users use the shuffle-enabled `repartition` API, they expect the partition they got should be the exact number they provided, even if they call shuffle-disabled `coalesce` later. Currently, `CollapseRepartition` rule does not consider whether shuffle is enabled or not. Thus, we got the following unexpected result. {noformat} val df = spark.range(0, 1, 1, 5) val df2 = df.repartition(10) assert(df2.coalesce(13).rdd.getNumPartitions == 5) assert(df2.coalesce(7).rdd.getNumPartitions == 5) assert(df2.coalesce(3).rdd.getNumPartitions == 3) {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866675#comment-15866675 ] Nick Pentreath commented on SPARK-19208: When I said "estimator-like", I didn't mean it should necessarily be an actual {{Estimator}} (I agree it is not really intended to fit into transformers & pipelines), but rather mimic the API, i.e. that the summarizer is "fitted" on a dataset to return a summary. I just wasn't too keen on the idea of returning a struct as it just feels sort of clunky relative to returning a df with vector columns {{"mean", "min", "max"}} etc. Supporting SS and {{groupBy}} seems like an important goal, so something like [~timhunter]'s suggestion looks like it will work nicely. For doing it via catalyst rules, that would be first prize to automatically re-use the intermediate results for multiple end-result computations, and only compute what is necessary for the required end-results. But, is that even supported for UDTs currently? I'm not an expert but my understanding was that is not supported yet. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS
[ https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866544#comment-15866544 ] Sean Owen commented on SPARK-19600: --- You indicated it was the same issue, but, it's still not the right place to ask the question. > ArrayIndexOutOfBoundsException in ALS > - > > Key: SPARK-19600 > URL: https://issues.apache.org/jira/browse/SPARK-19600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 >Reporter: zhengxiang pan >Priority: Blocker > > Understand issue SPARK-3080 closed, but I don't understand yet what cause the > issue: memory, parallelism, negative userID or product ID? > I consistently ran into this issue with different set of training set, can > you suggest any area to look at? > java.lang.ArrayIndexOutOfBoundsException: 221529807 > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944) > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535 ] Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM: - I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have an API like this: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. was (Author: timhunter): I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For
[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization
[ https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535 ] Timothy Hunter commented on SPARK-19208: I am not sure if we should follow the Estimator API for classical statistics: - it does not transform the data, it only gets fitted, so it does not quite fit the Estimator API. - more generally, I would argue that the use case is to get some information about a dataframe for its own sake, rather than being part of a ML pipeline. For instance, there was no attempt to fit these algorithms into spark.mllib estimator/model API, and basic scalers are already in the transformer API. I want to second [~josephkb]'s API, because it is the most flexible with respect to implementation, and the only one that is compatible with structured streaming and groupBy. That means users will be able to use all the summary stats without additional work from us to retrofit the API to structured streaming. Furthermore, the exact implementation details (a single private UDAF, more optimized catalyst-based transforms) can be implemented in the future without changing the API. As an intermediate step, if introducing catalyst rules is too hard for now and if we want to address [~mlnick]'s points (a) and (b), we can have a the following API: {code} df.select(VectorSummary.summary("features", "min", "mean", ...) df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", ...) {code} or: {code} df.select(VectorSummary.summaryStats("min", "mean").summary("features") df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", "weights") {code} What do you think? I will be happy to put together a proposal. > MultivariateOnlineSummarizer performance optimization > - > > Key: SPARK-19208 > URL: https://issues.apache.org/jira/browse/SPARK-19208 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: zhengruifeng > Attachments: Tests.pdf, WechatIMG2621.jpeg > > > Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using > {{MultivariateOnlineSummarizer}} to compute the min/max. > However {{MultivariateOnlineSummarizer}} will also compute extra unused > statistics. It slows down the task, moreover it is more prone to cause OOM. > For example: > env : --driver-memory 4G --executor-memory 1G --num-executors 4 > data: > [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)] > 748401 instances, and 29,890,095 features > {{MaxAbsScaler.fit}} fails because of OOM > {{MultivariateOnlineSummarizer}} maintains 8 arrays: > {code} > private var currMean: Array[Double] = _ > private var currM2n: Array[Double] = _ > private var currM2: Array[Double] = _ > private var currL1: Array[Double] = _ > private var totalCnt: Long = 0 > private var totalWeightSum: Double = 0.0 > private var weightSquareSum: Double = 0.0 > private var weightSum: Array[Double] = _ > private var nnz: Array[Long] = _ > private var currMax: Array[Double] = _ > private var currMin: Array[Double] = _ > {code} > For {{MaxAbsScaler}}, only 1 array is needed (max of abs value) > For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz) > After modication in the pr, the above example run successfully. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866523#comment-15866523 ] Herman van Hovell commented on SPARK-19518: --- Go for it. > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866523#comment-15866523 ] Herman van Hovell edited comment on SPARK-19518 at 2/14/17 8:00 PM: Go for it! was (Author: hvanhovell): Go for it. > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19501) Slow checking if there are many spark.yarn.jars, which are already on HDFS
[ https://issues.apache.org/jira/browse/SPARK-19501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19501. Resolution: Fixed Assignee: Jong Wook Kim Fix Version/s: 2.2.0 2.1.1 2.0.3 > Slow checking if there are many spark.yarn.jars, which are already on HDFS > -- > > Key: SPARK-19501 > URL: https://issues.apache.org/jira/browse/SPARK-19501 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Jong Wook Kim >Assignee: Jong Wook Kim >Priority: Minor > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > Hi, this is my first Spark issue submission and please excuse any > inconsistencies. > I am experiencing a slower application startup time when I specify many files > as {{spark.yarn.jars}}, by setting it as all JARs in an HDFS folder, such as > {{hdfs://namenode/user/spark/lib/*.jar}}. > Since the JAR files are already on the same HDFS that YARN is running, the > application should be very fast to startup. However, the delay is significant > especially when {{spark-submit}} is running from a non-local network, because > {{spark-yarn}} accesses the individual JAR files via HDFS, adding hundreds of > RTT before the application is ready. The official spark distribution with > Hadoop 2.7 has more than 200 jars, and >100 even if we exclude Hadoop and its > dependencies. > There are currently two HDFS RPC calls for each file, once at > {{ClientDistributedCacheManager.addResource}} calling {{fs.getFileStatus}}, > and another at {{yarn.Client.copyFileToRemote}} calling {{fc.resolvePath}}. I > suppose that both are unnecessary, since we [already retrieved all > FileStatuses, and that those are not > symlinks|https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531]. > To fix this, I suppose that we can modify {{addResource}} to use its > {{statCache}} variable before making an HDFS RPC and populate {{statCache}} > appropriately before calling {{addResource}}. Also, an optional boolean > parameter of {{copyFileToRemote}} can be added to skip the symlink check. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS
[ https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866516#comment-15866516 ] zhengxiang pan commented on SPARK-19600: are you sure it is duplicated issue as SPARK-3080 even without looking at? > ArrayIndexOutOfBoundsException in ALS > - > > Key: SPARK-19600 > URL: https://issues.apache.org/jira/browse/SPARK-19600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 >Reporter: zhengxiang pan >Priority: Blocker > > Understand issue SPARK-3080 closed, but I don't understand yet what cause the > issue: memory, parallelism, negative userID or product ID? > I consistently ran into this issue with different set of training set, can > you suggest any area to look at? > java.lang.ArrayIndexOutOfBoundsException: 221529807 > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944) > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS
[ https://issues.apache.org/jira/browse/SPARK-19600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19600. --- Resolution: Duplicate Questions go to the mailing list. Don't open duplicate JIRAs to re-ask. > ArrayIndexOutOfBoundsException in ALS > - > > Key: SPARK-19600 > URL: https://issues.apache.org/jira/browse/SPARK-19600 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.1 >Reporter: zhengxiang pan >Priority: Blocker > > Understand issue SPARK-3080 closed, but I don't understand yet what cause the > issue: memory, parallelism, negative userID or product ID? > I consistently ran into this issue with different set of training set, can > you suggest any area to look at? > java.lang.ArrayIndexOutOfBoundsException: 221529807 > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944) > at > org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19518) IGNORE NULLS in first_value / last_value should be supported in SQL statements
[ https://issues.apache.org/jira/browse/SPARK-19518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866461#comment-15866461 ] Sameer Abhyankar commented on SPARK-19518: -- I have come across this issue recently with Spark 2.0 and would like to work on this unless it is already assigned to someone else! > IGNORE NULLS in first_value / last_value should be supported in SQL statements > -- > > Key: SPARK-19518 > URL: https://issues.apache.org/jira/browse/SPARK-19518 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ferenc Erdelyi > > https://issues.apache.org/jira/browse/SPARK-13049 was implemented in Spark2, > however it does not work in SQL statements as it is not implemented in Hive > yet: https://issues.apache.org/jira/browse/HIVE-11189 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866445#comment-15866445 ] Apache Spark commented on SPARK-19599: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/16932 > Clean up HDFSMetadataLog for Hadoop 2.6+ > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19599: Assignee: Apache Spark > Clean up HDFSMetadataLog for Hadoop 2.6+ > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19599: Assignee: (was: Apache Spark) > Clean up HDFSMetadataLog for Hadoop 2.6+ > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
[ https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-19529. Resolution: Fixed > TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() > --- > > Key: SPARK-19529 > URL: https://issues.apache.org/jira/browse/SPARK-19529 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > In Spark's Netty RPC layer, TransportClientFactory.createClient() calls > awaitUninterruptibly() on a Netty future while waiting for a connection to be > established. This creates problem when a Spark task is interrupted while > blocking in this call (which can happen in the event of a slow connection > which will eventually time out). This has bad impacts on task cancellation > when interruptOnCancel = true. > As an example of the impact of this problem, I experienced significant > numbers of uncancellable "zombie tasks" on a production cluster where several > tasks were blocked trying to connect to a dead shuffle server and then > continued running as zombies after I cancelled the associated Spark stage. > The zombie tasks ran for several minutes with the following stack: > {code} > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:460) > io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) > io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > => holding Monitor(java.lang.Object@1849476028}) > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala: > 350) > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120) > > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45) > > org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > [...] > {code} > I believe that we can easily fix this by using the > InterruptedException-throwing await() instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19600) ArrayIndexOutOfBoundsException in ALS
zhengxiang pan created SPARK-19600: -- Summary: ArrayIndexOutOfBoundsException in ALS Key: SPARK-19600 URL: https://issues.apache.org/jira/browse/SPARK-19600 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.1 Reporter: zhengxiang pan Priority: Blocker Understand issue SPARK-3080 closed, but I don't understand yet what cause the issue: memory, parallelism, negative userID or product ID? I consistently ran into this issue with different set of training set, can you suggest any area to look at? java.lang.ArrayIndexOutOfBoundsException: 221529807 at org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:944) at org.apache.spark.ml.recommendation.ALS$$anonfun$partitionRatings$1$$anonfun$apply$6.apply(ALS.scala:940) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:200) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
[ https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-19529: --- Fix Version/s: 1.6.4 > TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() > --- > > Key: SPARK-19529 > URL: https://issues.apache.org/jira/browse/SPARK-19529 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > In Spark's Netty RPC layer, TransportClientFactory.createClient() calls > awaitUninterruptibly() on a Netty future while waiting for a connection to be > established. This creates problem when a Spark task is interrupted while > blocking in this call (which can happen in the event of a slow connection > which will eventually time out). This has bad impacts on task cancellation > when interruptOnCancel = true. > As an example of the impact of this problem, I experienced significant > numbers of uncancellable "zombie tasks" on a production cluster where several > tasks were blocked trying to connect to a dead shuffle server and then > continued running as zombies after I cancelled the associated Spark stage. > The zombie tasks ran for several minutes with the following stack: > {code} > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:460) > io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) > io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > => holding Monitor(java.lang.Object@1849476028}) > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala: > 350) > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120) > > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45) > > org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > [...] > {code} > I believe that we can easily fix this by using the > InterruptedException-throwing await() instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
[ https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-19529: --- Target Version/s: 1.6.4, 2.0.3, 2.1.1, 2.2.0 (was: 1.6.3, 2.0.3, 2.1.1, 2.2.0) > TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() > --- > > Key: SPARK-19529 > URL: https://issues.apache.org/jira/browse/SPARK-19529 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 1.6.4, 2.0.3, 2.1.1, 2.2.0 > > > In Spark's Netty RPC layer, TransportClientFactory.createClient() calls > awaitUninterruptibly() on a Netty future while waiting for a connection to be > established. This creates problem when a Spark task is interrupted while > blocking in this call (which can happen in the event of a slow connection > which will eventually time out). This has bad impacts on task cancellation > when interruptOnCancel = true. > As an example of the impact of this problem, I experienced significant > numbers of uncancellable "zombie tasks" on a production cluster where several > tasks were blocked trying to connect to a dead shuffle server and then > continued running as zombies after I cancelled the associated Spark stage. > The zombie tasks ran for several minutes with the following stack: > {code} > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:460) > io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) > io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > => holding Monitor(java.lang.Object@1849476028}) > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala: > 350) > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120) > > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45) > > org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > [...] > {code} > I believe that we can easily fix this by using the > InterruptedException-throwing await() instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
[ https://issues.apache.org/jira/browse/SPARK-19599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-19599: --- Affects Version/s: (was: 2.1.0) 2.2.0 > Clean up HDFSMetadataLog for Hadoop 2.6+ > > > Key: SPARK-19599 > URL: https://issues.apache.org/jira/browse/SPARK-19599 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu > > SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some > cleanup for HDFSMetadataLog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19529) TransportClientFactory.createClient() shouldn't call awaitUninterruptibly()
[ https://issues.apache.org/jira/browse/SPARK-19529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-19529: --- Fix Version/s: 2.2.0 2.1.1 2.0.3 > TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() > --- > > Key: SPARK-19529 > URL: https://issues.apache.org/jira/browse/SPARK-19529 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > In Spark's Netty RPC layer, TransportClientFactory.createClient() calls > awaitUninterruptibly() on a Netty future while waiting for a connection to be > established. This creates problem when a Spark task is interrupted while > blocking in this call (which can happen in the event of a slow connection > which will eventually time out). This has bad impacts on task cancellation > when interruptOnCancel = true. > As an example of the impact of this problem, I experienced significant > numbers of uncancellable "zombie tasks" on a production cluster where several > tasks were blocked trying to connect to a dead shuffle server and then > continued running as zombies after I cancelled the associated Spark stage. > The zombie tasks ran for several minutes with the following stack: > {code} > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:460) > io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607) > io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224) > > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) > => holding Monitor(java.lang.Object@1849476028}) > org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140) > > org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120) > > org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala: > 350) > org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286) > > org.apache.spark.storage.ShuffleBlockFetcherIterator.(ShuffleBlockFetcherIterator.scala:120) > > org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45) > > org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169) > > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > [...] > {code} > I believe that we can easily fix this by using the > InterruptedException-throwing await() instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19599) Clean up HDFSMetadataLog for Hadoop 2.6+
Shixiong Zhu created SPARK-19599: Summary: Clean up HDFSMetadataLog for Hadoop 2.6+ Key: SPARK-19599 URL: https://issues.apache.org/jira/browse/SPARK-19599 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Shixiong Zhu SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19587) Disallow when sort columns are part of partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19587: Assignee: (was: Apache Spark) > Disallow when sort columns are part of partitioning columns > --- > > Key: SPARK-19587 > URL: https://issues.apache.org/jira/browse/SPARK-19587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > This came up in discussion at > https://github.com/apache/spark/pull/16898#discussion_r100697138 > Allowing partition columns to be a part of sort columns should not be > supported (logically it does not make sense). > {code} > df.write > .format(source) > .partitionBy("i") > .bucketBy(8, "x") > .sortBy("i") > .saveAsTable("bucketed_table") > {code} > Hive fails for such case. > {code} > CREATE TABLE user_info_bucketed(user_id BIGINT) > PARTITIONED BY(ds STRING) > CLUSTERED BY(user_id) > SORTED BY (ds ASC) > INTO 8 BUCKETS; > > FAILED: SemanticException [Error 10002]: Invalid column reference > Caused by: SemanticException: Invalid column reference > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19587) Disallow when sort columns are part of partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866437#comment-15866437 ] Apache Spark commented on SPARK-19587: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/16931 > Disallow when sort columns are part of partitioning columns > --- > > Key: SPARK-19587 > URL: https://issues.apache.org/jira/browse/SPARK-19587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil > > This came up in discussion at > https://github.com/apache/spark/pull/16898#discussion_r100697138 > Allowing partition columns to be a part of sort columns should not be > supported (logically it does not make sense). > {code} > df.write > .format(source) > .partitionBy("i") > .bucketBy(8, "x") > .sortBy("i") > .saveAsTable("bucketed_table") > {code} > Hive fails for such case. > {code} > CREATE TABLE user_info_bucketed(user_id BIGINT) > PARTITIONED BY(ds STRING) > CLUSTERED BY(user_id) > SORTED BY (ds ASC) > INTO 8 BUCKETS; > > FAILED: SemanticException [Error 10002]: Invalid column reference > Caused by: SemanticException: Invalid column reference > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19587) Disallow when sort columns are part of partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-19587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19587: Assignee: Apache Spark > Disallow when sort columns are part of partitioning columns > --- > > Key: SPARK-19587 > URL: https://issues.apache.org/jira/browse/SPARK-19587 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Tejas Patil >Assignee: Apache Spark > > This came up in discussion at > https://github.com/apache/spark/pull/16898#discussion_r100697138 > Allowing partition columns to be a part of sort columns should not be > supported (logically it does not make sense). > {code} > df.write > .format(source) > .partitionBy("i") > .bucketBy(8, "x") > .sortBy("i") > .saveAsTable("bucketed_table") > {code} > Hive fails for such case. > {code} > CREATE TABLE user_info_bucketed(user_id BIGINT) > PARTITIONED BY(ds STRING) > CLUSTERED BY(user_id) > SORTED BY (ds ASC) > INTO 8 BUCKETS; > > FAILED: SemanticException [Error 10002]: Invalid column reference > Caused by: SemanticException: Invalid column reference > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866428#comment-15866428 ] Armin Braun commented on SPARK-19592: - Imo this also relates to the ability to handle https://issues.apache.org/jira/browse/SPARK-8985 in a clean way btw. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
[ https://issues.apache.org/jira/browse/SPARK-19598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866410#comment-15866410 ] Reynold Xin commented on SPARK-19598: - cc [~windpiger] are you interested in working on this? > Remove the alias parameter in UnresolvedRelation > > > Key: SPARK-19598 > URL: https://issues.apache.org/jira/browse/SPARK-19598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin > > UnresolvedRelation has a second argument named "alias", for assigning the > relation an alias. I think we can actually remove it and replace its use with > a SubqueryAlias. > This would actually simplify some analyzer code to only match on > SubqueryAlias. For example, the broadcast hint pull request can have one > fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19598) Remove the alias parameter in UnresolvedRelation
Reynold Xin created SPARK-19598: --- Summary: Remove the alias parameter in UnresolvedRelation Key: SPARK-19598 URL: https://issues.apache.org/jira/browse/SPARK-19598 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Reynold Xin UnresolvedRelation has a second argument named "alias", for assigning the relation an alias. I think we can actually remove it and replace its use with a SubqueryAlias. This would actually simplify some analyzer code to only match on SubqueryAlias. For example, the broadcast hint pull request can have one fewer case https://github.com/apache/spark/pull/16925/files. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5
[ https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19571. -- Resolution: Fixed Assignee: Hyukjin Kwon > tests are failing to run on Windows with another instance Derby error with > Hadoop 2.6.5 > --- > > Key: SPARK-19571 > URL: https://issues.apache.org/jira/browse/SPARK-19571 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung >Assignee: Hyukjin Kwon > > Between > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master > https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db > And > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master > https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a > Something is changed (not likely caused by R) such that tests running on > Windows are consistently failing with > {code} > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database > C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown > Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > {code} > Since we run appveyor only when there is R changes, it is a bit harder to > track down which change specifically caused this. > We also can't run appveyor on branch-2.1, so it could also be broken there. > This could be a blocker, since it could fail tests for the R release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19571) tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5
[ https://issues.apache.org/jira/browse/SPARK-19571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-19571: - Summary: tests are failing to run on Windows with another instance Derby error with Hadoop 2.6.5 (was: appveyor windows tests are failing) > tests are failing to run on Windows with another instance Derby error with > Hadoop 2.6.5 > --- > > Key: SPARK-19571 > URL: https://issues.apache.org/jira/browse/SPARK-19571 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.2.0 >Reporter: Felix Cheung > > Between > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/751-master > https://github.com/apache/spark/commit/7a7ce272fe9a703f58b0180a9d2001ecb5c4b8db > And > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/build/758-master > https://github.com/apache/spark/commit/c618ccdbe9ac103dfa3182346e2a14a1e7fca91a > Something is changed (not likely caused by R) such that tests running on > Windows are consistently failing with > {code} > Caused by: ERROR XSDB6: Another instance of Derby may have already booted the > database > C:\Users\appveyor\AppData\Local\Temp\1\spark-75266bb9-bd54-4ee2-ae54-2122d2c011e8\metastore. > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at org.apache.derby.iapi.error.StandardException.newException(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown > Source) > at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown > Source) > at java.security.AccessController.doPrivileged(Native Method) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown > Source) > at > org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source) > at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown > Source) > at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown > Source) > {code} > Since we run appveyor only when there is R changes, it is a bit harder to > track down which change specifically caused this. > We also can't run appveyor on branch-2.1, so it could also be broken there. > This could be a blocker, since it could fail tests for the R release. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19163) Lazy creation of the _judf
[ https://issues.apache.org/jira/browse/SPARK-19163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz resolved SPARK-19163. Resolution: Fixed Fix Version/s: 2.2.0 > Lazy creation of the _judf > -- > > Key: SPARK-19163 > URL: https://issues.apache.org/jira/browse/SPARK-19163 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 >Reporter: Maciej Szymkiewicz > Fix For: 2.2.0 > > > Current state > Right {{UserDefinedFunction}} eagerly creates {{_judf}} and initializes > {{SparkSession}} > (https://github.com/apache/spark/blob/master/python/pyspark/sql/functions.py#L1832) > as a side effect. This behavior may have undesired results when {{udf}} is > imported from a module: > {{myudfs.py}} > {code} > from pyspark.sql.functions import udf > from pyspark.sql.types import IntegerType > > def _add_one(x): > """Adds one""" > if x is not None: > return x + 1 > > add_one = udf(_add_one, IntegerType()) > {code} > > > Example session: > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: from myudfs import add_one > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/01/07 19:55:44 WARN Utils: Your hostname, xxx resolves to a loopback > address: 127.0.1.1; using xxx instead (on interface eth0) > 17/01/07 19:55:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > In [3]: spark = SparkSession.builder.appName("foo").getOrCreate() > In [4]: spark.sparkContext.appName > Out[4]: 'pyspark-shell' > {code} > Proposed > Delay {{_judf}} initialization until the first call. > {code} > In [1]: from pyspark.sql import SparkSession > In [2]: from myudfs import add_one > In [3]: spark = SparkSession.builder.appName("foo").getOrCreate() > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/01/07 19:58:38 WARN Utils: Your hostname, xxx resolves to a loopback > address: 127.0.1.1; using xxx instead (on interface eth0) > 17/01/07 19:58:38 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > In [4]: spark.sparkContext.appName > Out[4]: 'foo' > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866393#comment-15866393 ] Shea Parkes commented on SPARK-18541: - Thank you very much! > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Assignee: Shea Parkes >Priority: Minor > Labels: newbie > Fix For: 2.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866367#comment-15866367 ] Armin Braun commented on SPARK-19592: - [~srowen] {quote} What about tests that make their own conf or need to? {quote} Those tests in particular made me interested in this for correctness/readability reasons. Maybe an example helps :) in _org.apache.spark.streaming.InputStreamsSuite_ we have the situation that the conf is set up in the parent suit via just {code} val conf = new SparkConf() .setMaster(master) .setAppName(framework) {code} now if you run that suit from the IDE one of the tests fails with an apparent error in the logic. {code} The code passed to eventually never returned normally. Attempted 664 times over 10.01260721901 seconds. Last failure message: 10 did not equal 5. {code} You debug it and find out that it's because you get some _StreamingListener_ added to the context twice because the tests adds one manually that is already on the context. Reason for that being that it's also added by the UI when you have _spark.ui.enabled_ set to default _true_. So basically you now have a seemingly redundant line of code in a bunch of tests: {code} ssc.addStreamingListener(ssc.progressListener) {code} ... that appears wrong with the configuration (that you see if you just read the code) and requires you to also consider (maintain) what Maven or SBT is injecting in terms of environment. --- So I think those tests that make their own config are the most troublesome since they have non-standard defaults injected. In my opinion it would be a lot easier to work with if the defaults would just be the standard production env. defaults when I create a new instance of the SparkConf and all deviation from that would be explicit in the code. I agree it's not a big pain, still a quality issue worth fixing (imo). Reduces maintenance effort from drier test configs and makes tests easier to read like in the example above. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19552) Upgrade Netty version to 4.1.8 final
[ https://issues.apache.org/jira/browse/SPARK-19552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-19552. -- Resolution: Later > Upgrade Netty version to 4.1.8 final > > > Key: SPARK-19552 > URL: https://issues.apache.org/jira/browse/SPARK-19552 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.1.0 >Reporter: Adam Roberts >Priority: Minor > > Netty 4.1.8 was recently released but isn't API compatible with previous > major versions (like Netty 4.0.x), see > http://netty.io/news/2017/01/30/4-0-44-Final-4-1-8-Final.html for details. > This version does include a fix for a security concern but not one we'd be > exposed to with Spark "out of the box". Let's upgrade the version we use to > be on the safe side as the security fix I'm especially interested in is not > available in the 4.0.x release line. > We should move up anyway to take on a bunch of other big fixes cited in the > release notes (and if anyone were to use Spark with netty and tcnative, they > shouldn't be exposed to the security problem) - we should be good citizens > and make this change. > As this 4.1 version involves API changes we'll need to implement a few > methods and possibly adjust the Sasl tests. This JIRA and associated pull > request starts the process which I'll work on - and any help would be much > appreciated! Currently I know: > {code} > @Override > public void write(ChannelHandlerContext ctx, Object msg, ChannelPromise > promise) > throws Exception { > if (!foundEncryptionHandler) { > foundEncryptionHandler = > ctx.channel().pipeline().get(encryptHandlerName) != null; <-- this > returns false and causes test failures > } > ctx.write(msg, promise); > } > {code} > Here's what changes will be required (at least): > {code} > common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java{code} > requires touch, retain and transferred methods > {code} > common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java{code} > requires the above methods too > {code}common/network-common/src/test/java/org/apache/spark/network/protocol/MessageWithHeaderSuite.java{code} > With "dummy" implementations so we can at least compile and test, we'll see > five new test failures to address. > These are > {code} > org.apache.spark.network.sasl.SparkSaslSuite.testFileRegionEncryption > org.apache.spark.network.sasl.SparkSaslSuite.testSaslEncryption > org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.send with SASL encryption > org.apache.spark.rpc.netty.NettyRpcEnvSuite.ask with SASL encryption > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19592) Duplication in Test Configuration Relating to SparkConf Settings Should be Removed
[ https://issues.apache.org/jira/browse/SPARK-19592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866311#comment-15866311 ] Sean Owen commented on SPARK-19592: --- What about tests that make their own conf or need to? I don't think this is a significant pain to bother with. > Duplication in Test Configuration Relating to SparkConf Settings Should be > Removed > -- > > Key: SPARK-19592 > URL: https://issues.apache.org/jira/browse/SPARK-19592 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 2.1.0, 2.2.0 > Environment: Applies to all Environments >Reporter: Armin Braun >Priority: Minor > > This configuration for Surefire, Scalatest is duplicated in the parent POM as > well as the SBT build. > While this duplication cannot be removed in general it can at least be > removed for all system properties that simply result in a SparkConf setting I > think. > Instead of having lines like > {code} > false > {code} > twice in the pom.xml > and once in SBT as > {code} > javaOptions in Test += "-Dspark.ui.enabled=false", > {code} > it would be a lot cleaner to simply have a > {code} > var conf: SparkConf > {code} > field in > {code} > org.apache.spark.SparkFunSuite > {code} > that has SparkConf set up with all the shared configuration that > `systemProperties` currently provide. Obviously this cannot be done straight > away given that > many subclasses of the parent suit do this, so I think it would be best to > simply add a method to the parent that provides this configuration for now > and start refactoring away duplication in other suit setups from there step > by step until the sys properties can be removed from the pom and sbt.build. > This makes the build a lot easier to maintain and makes tests more readable > by making the environment setup more explicit in the code. > (also it would allow running more tests straight from the IDE which is always > a nice thing imo) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19582) DataFrameReader conceptually inadequate
[ https://issues.apache.org/jira/browse/SPARK-19582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19582. --- Resolution: Invalid I don't understand what this is describing. Is it a dependency conflict? if so, what? You say DataFrameReader understands every possible source, but it doesn't of course. It is also not designed to exclude a particular data source of course. You can already supply a bunch of strings to make a Dataset of strings. Is this about Minio? what does 'forward' mean? Rather than reply, please continue with more clarity on the mailing list. This isn't clear or specific enough for a JIRA > DataFrameReader conceptually inadequate > --- > > Key: SPARK-19582 > URL: https://issues.apache.org/jira/browse/SPARK-19582 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.0 >Reporter: James Q. Arnold > > DataFrameReader assumes it "understands" all data sources (local file system, > object stores, jdbc, ...). This seems limiting in the long term, imposing > both development costs to accept new sources and dependency issues for > existing sources (how to coordinate the XX jar for internal use vs. the XX > jar used by the application). Unless I have missed how this can be done > currently, an application with an unsupported data source cannot create the > required RDD for distribution. > I recommend at least providing a text API for supplying data. Let the > application provide data as a String (or char[] or ...)---not a path, but the > actual data. Alternatively, provide interfaces or abstract classes the > application could provide to let the application handle external data > sources, without forcing all that complication into the Spark implementation. > I don't have any code to submit, but JIRA seemed like to most appropriate > place to raise the issue. > Finally, if I have overlooked how this can be done with the current API, a > new example would be appreciated. > Additional detail... > We use the minio object store, which provides an API compatible with AWS-S3. > A few configuration/parameter values differ for minio, but one can use the > AWS library in the application to connect to the minio server. > When trying to use minio objects through spark, the s3://xxx paths are > intercepted by spark and handed to hadoop. So far, I have been unable to > find the right combination of configuration values and parameters to > "convince" hadoop to forward the right information to work with minio. If I > could read the minio object in the application, and then hand the object > contents directly to spark, I could bypass hadoop and solve the problem. > Unfortunately, the underlying spark design prevents that. So, I see two > problems. > - Spark seems to have taken on the responsibility of "knowing" the API > details of all data sources. This seems iffy in the long run (and is the > root of my current problem). In the long run, it seems unwise to assume that > spark should understand all possible path names, protocols, etc. Moreover, > passing S3 paths to hadoop seems a little odd (why not go directly to AWS, > for example). This particular confusion about S3 shows the difficulties that > are bound to occur. > - Second, spark appears not to have a way to bypass the path name > interpretation. At the least, spark could provide a text/blob interface, > letting the application supply the data object and avoid path interpretation > inside spark. Alternatively, spark could accept a reader/stream/... to build > the object, again letting the application provide the implementation of the > object input. > As I mentioned above, I might be missing something in the API that lets us > work around the problem. I'll keep looking, but the API as apparently > structured seems too limiting. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib
[ https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866295#comment-15866295 ] Timothy Hunter commented on SPARK-14523: Also, the correlation is missing the multivariate case. I will take this task over unless one expresses some interest. > Feature parity for Statistics ML with MLlib > --- > > Key: SPARK-14523 > URL: https://issues.apache.org/jira/browse/SPARK-14523 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: yuhao yang > > Some statistics functions have been supported by DataFrame directly. Use this > jira to discuss/design the statistics package in Spark.ML and its function > scope. Hypothesis test and correlation computation may still need to expose > independent interfaces. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)
[ https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866288#comment-15866288 ] Timothy Hunter commented on SPARK-4591: --- [~josephkb] do you also want some subtasks for KernelDensity and multivariate summaries? They are in the state module but not covered. > Algorithm/model parity for spark.ml (Scala) > --- > > Key: SPARK-4591 > URL: https://issues.apache.org/jira/browse/SPARK-4591 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Xiangrui Meng >Priority: Critical > > This is an umbrella JIRA for porting spark.mllib implementations to use the > DataFrame-based API defined under spark.ml. We want to achieve critical > feature parity for the next release. > h3. Instructions for 3 subtask types > *Review tasks*: detailed review of a subpackage to identify feature gaps > between spark.mllib and spark.ml. > * Should be listed as a subtask of this umbrella. > * Review subtasks cover major algorithm groups. To pick up a review subtask, > please: > ** Comment that you are working on it. > ** Compare the public APIs of spark.ml vs. spark.mllib. > ** Comment on all missing items within spark.ml: algorithms, models, methods, > features, etc. > ** Check for existing JIRAs covering those items. If there is no existing > JIRA, create one, and link it to your comment. > *Critical tasks*: higher priority missing features which are required for > this umbrella JIRA. > * Should be linked as "requires" links. > *Other tasks*: lower priority missing features which can be completed after > the critical tasks. > * Should be linked as "contains" links. > h4. Excluded items > This does *not* include: > * Python: We can compare Scala vs. Python in spark.ml itself. > * Moving linalg to spark.ml: [SPARK-13944] > * Streaming ML: Requires stabilizing some internal APIs of structured > streaming first > h3. TODO list > *Critical issues* > * [SPARK-14501]: Frequent Pattern Mining > * [SPARK-14709]: linear SVM > * [SPARK-15784]: Power Iteration Clustering (PIC) > *Lower priority issues* > * Missing methods within algorithms (see Issue Links below) > * evaluation submodule > * stat submodule (should probably be covered in DataFrames) > * Developer-facing submodules: > ** optimization (including [SPARK-17136]) > ** random, rdd > ** util > *To be prioritized* > * single-instance prediction: [SPARK-10413] > * pmml [SPARK-11171] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-18541: --- Assignee: Shea Parkes > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Assignee: Shea Parkes >Priority: Minor > Labels: newbie > Fix For: 2.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18541) Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management in pyspark SQL API
[ https://issues.apache.org/jira/browse/SPARK-18541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-18541. - Resolution: Fixed Fix Version/s: 2.2.0 > Add pyspark.sql.Column.aliasWithMetadata to allow dynamic metadata management > in pyspark SQL API > > > Key: SPARK-18541 > URL: https://issues.apache.org/jira/browse/SPARK-18541 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.0.2 > Environment: all >Reporter: Shea Parkes >Assignee: Shea Parkes >Priority: Minor > Labels: newbie > Fix For: 2.2.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > In the Scala SQL API, you can pass in new metadata when you alias a field. > That functionality is not available in the Python API. Right now, you have > to painfully utilize {{SparkSession.createDataFrame}} to manipulate the > metadata for even a single column. I would propose to add the following > method to {{pyspark.sql.Column}}: > {code} > def aliasWithMetadata(self, name, metadata): > """ > Make a new Column that has the provided alias and metadata. > Metadata will be processed with json.dumps() > """ > _context = pyspark.SparkContext._active_spark_context > _metadata_str = json.dumps(metadata) > _metadata_jvm = > _context._jvm.org.apache.spark.sql.types.Metadata.fromJson(_metadata_str) > _new_java_column = getattr(self._jc, 'as')(name, _metadata_jvm) > return Column(_new_java_column) > {code} > I can likely complete this request myself if there is any interest for it. > Just have to dust off my knowledge of doctest and the location of the python > tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13219) Pushdown predicate propagation in SparkSQL with join
[ https://issues.apache.org/jira/browse/SPARK-13219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866262#comment-15866262 ] Nick Dimiduk commented on SPARK-13219: -- I would implement this manually by materializing the smaller relation in the driver and then transforming those values into a filter applied to the larger. Frankly I expected this to be meaning of a broadcast join. I'm wondering if I'm doing something to prevent the planner from performing this optimization, so maybe the mailing list is a more appropriate place to discuss? > Pushdown predicate propagation in SparkSQL with join > > > Key: SPARK-13219 > URL: https://issues.apache.org/jira/browse/SPARK-13219 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.0 > Environment: Spark 1.4 > Datastax Spark connector 1.4 > Cassandra. 2.1.12 > Centos 6.6 >Reporter: Abhinav Chawade > > When 2 or more tables are joined in SparkSQL and there is an equality clause > in query on attributes used to perform the join, it is useful to apply that > clause on scans for both table. If this is not done, one of the tables > results in full scan which can reduce the query dramatically. Consider > following example with 2 tables being joined. > {code} > CREATE TABLE assets ( > assetid int PRIMARY KEY, > address text, > propertyname text > ) > CREATE TABLE tenants ( > assetid int PRIMARY KEY, > name text > ) > spark-sql> explain select t.name from tenants t, assets a where a.assetid = > t.assetid and t.assetid='1201'; > WARN 2016-02-05 23:05:19 org.apache.hadoop.util.NativeCodeLoader: Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > == Physical Plan == > Project [name#14] > ShuffledHashJoin [assetid#13], [assetid#15], BuildRight > Exchange (HashPartitioning 200) >Filter (CAST(assetid#13, DoubleType) = 1201.0) > HiveTableScan [assetid#13,name#14], (MetastoreRelation element, tenants, > Some(t)), None > Exchange (HashPartitioning 200) >HiveTableScan [assetid#15], (MetastoreRelation element, assets, Some(a)), > None > Time taken: 1.354 seconds, Fetched 8 row(s) > {code} > The simple workaround is to add another equality condition for each table but > it becomes cumbersome. It will be helpful if the query planner could improve > filter propagation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12957) Derive and propagate data constrains in logical plan
[ https://issues.apache.org/jira/browse/SPARK-12957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866239#comment-15866239 ] Nick Dimiduk commented on SPARK-12957: -- [~sameerag] thanks for the comment. From a naive scan of the tickets, I believe I am seeing the benefits of SPARK-13871 in that a {{IsNotNull}} constraint is applied from the names of the join columns. However, I don't see the boon of SPARK-13789, specifically the {{a = 5, a = b}} mentioned in the description. My query is a join between a very small relation (100's of rows) and a very large one (10's of billions). I've hinted the planner to broadcast the smaller table, which it honors. After SPARK-13789, I expected to see the join column values pushed down as well. This is not the case. Any tips on debugging this further? I've set breakpoints in the {{RelationProvider}} implementation and see that it's only receiving the {{IsNotNull}} filters, nothing further from the planner. Thanks a lot! > Derive and propagate data constrains in logical plan > - > > Key: SPARK-12957 > URL: https://issues.apache.org/jira/browse/SPARK-12957 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Yin Huai >Assignee: Sameer Agarwal > Attachments: ConstraintPropagationinSparkSQL.pdf > > > Based on the semantic of a query plan, we can derive data constrains (e.g. if > a filter defines {{a > 10}}, we know that the output data of this filter > satisfy the constrain of {{a > 10}} and {{a is not null}}). We should build a > framework to derive and propagate constrains in the logical plan, which can > help us to build more advanced optimizations. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19162) UserDefinedFunction constructor should verify that func is callable
[ https://issues.apache.org/jira/browse/SPARK-19162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-19162. - Resolution: Fixed Fix Version/s: 2.2.0 > UserDefinedFunction constructor should verify that func is callable > --- > > Key: SPARK-19162 > URL: https://issues.apache.org/jira/browse/SPARK-19162 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Minor > Fix For: 2.2.0 > > > Current state > Right now `UserDefinedFunctions` don't perform any input type validation. It > will accept non-callable objects just to fail with hard to understand > traceback: > {code} > In [1]: from pyspark.sql.functions import udf > In [2]: df = spark.range(0, 1) > In [3]: f = udf(None) > In [4]: df.select(f()).first() > 17/01/07 19:30:50 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7) > ... > Py4JJavaError: An error occurred while calling o51.collectToPython. > ... > TypeError: 'NoneType' object is not callable > ... > {code} > Proposed > Apply basic validation for {{func}} argument: > {code} > In [7]: udf(None) > --- > TypeError Traceback (most recent call last) > in () > > 1 udf(None) > ... > TypeError: func should be a callable object (a function or an instance of a > class with __call__). Got > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org