[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging
[ https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330550#comment-15330550 ] Russell Alexander Spitzer commented on SPARK-13928: --- So users(like me ;) ) need to write their own Logging trait now? I'm a little confused based on the description. > Move org.apache.spark.Logging into org.apache.spark.internal.Logging > > > Key: SPARK-13928 > URL: https://issues.apache.org/jira/browse/SPARK-13928 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Reynold Xin >Assignee: Wenchen Fan > Fix For: 2.0.0 > > > Logging was made private in Spark 2.0. If we move it, then users would be > able to create a Logging trait themselves to avoid changing their own code. > Alternatively, we can also provide in a compatibility package that adds > logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15326793#comment-15326793 ] Russell Alexander Spitzer commented on SPARK-11415: --- I honestly am super confused about what is supposed to happen vs what isn't. I ended up just tuning the connector to be aware of this so at least dates don't get corrupted when coming out of C*. https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/types/TypeConverter.scala#L314-L318 I blame java.sql.date for being ... strange. > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15168168#comment-15168168 ] Russell Alexander Spitzer commented on SPARK-11293: --- Saw a similar issue loading the Friendster graph into Tinkerpop and running the BulkVertexLoader with the Shuffled Vertex RDD persisted to disk. > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5
[ https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158213#comment-15158213 ] Russell Alexander Spitzer commented on SPARK-13289: --- same > Word2Vec generate infinite distances when numIterations>5 > - > > Key: SPARK-13289 > URL: https://issues.apache.org/jira/browse/SPARK-13289 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux, Scala >Reporter: Qi Dai > Labels: features > > I recently ran some word2vec experiments on a cluster with 50 executors on > some large text dataset but find out that when number of iterations is larger > than 5 the distance between words will be all infinite. My code looks like > this: > val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" > ").toSeq) > import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} > val word2vec = new > Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5) > val model = word2vec.fit(text) > val synonyms = model.findSynonyms("who", 40) > for((synonym, cosineSimilarity) <- synonyms) { > println(s"$synonym $cosineSimilarity") > } > The results are: > to Infinity > and Infinity > that Infinity > with Infinity > said Infinity > it Infinity > by Infinity > be Infinity > have Infinity > he Infinity > has Infinity > his Infinity > an Infinity > ) Infinity > not Infinity > who Infinity > I Infinity > had Infinity > their Infinity > were Infinity > they Infinity > but Infinity > been Infinity > I tried many different datasets and different words for finding synonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118222#comment-15118222 ] Russell Alexander Spitzer commented on SPARK-12639: --- I keep forgetting about your bot! > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15118221#comment-15118221 ] Russell Alexander Spitzer commented on SPARK-12639: --- https://github.com/apache/spark/pull/10929/files [~yhuai] Made another PR for the * solution. I'm not sure where this should be documented but if this is what you are looking for I'll find a place to note it :) > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12143) When column type is binary, select occurs ClassCastExcption in Beeline.
[ https://issues.apache.org/jira/browse/SPARK-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15113337#comment-15113337 ] Russell Alexander Spitzer commented on SPARK-12143: --- I see this as well using beeline to read from spark-sql-thriftserver on a table with a binary column. {code} 0: jdbc:hive2://localhost:1> describe bios; ++++ | col_name | data_type | comment | ++++ | user_name | string | from deserializer | | bio| binary | from deserializer | ++++ 2 rows selected (0.03 seconds) 0: jdbc:hive2://localhost:1> select * from bios; Error: java.lang.String cannot be cast to [B (state=,code=0) {code} > When column type is binary, select occurs ClassCastExcption in Beeline. > --- > > Key: SPARK-12143 > URL: https://issues.apache.org/jira/browse/SPARK-12143 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: meiyoula > > In Beeline, execute below sql: > 1. create table bb(bi binary); > 2. load data inpath 'tmp/data' into table bb; > 3.select * from bb; > Error: java.lang.ClassCastException: java.lang.String cannot be cast to [B > (state=, code=0) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088558#comment-15088558 ] Russell Alexander Spitzer commented on SPARK-12639: --- https://github.com/apache/spark/pull/10655 [~yhuai] > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088537#comment-15088537 ] Russell Alexander Spitzer commented on SPARK-12639: --- This is a regression of SPARK-11390 > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > HandledFilters: [] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters
[ https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082357#comment-15082357 ] Russell Alexander Spitzer commented on SPARK-11661: --- Yeah sure, np https://issues.apache.org/jira/browse/SPARK-12639, I'll work on the PR later this week > We should still pushdown filters returned by a data source's unhandledFilters > - > > Key: SPARK-11661 > URL: https://issues.apache.org/jira/browse/SPARK-11661 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0 > > > We added unhandledFilters interface to SPARK-10978. So, a data source has a > chance to let Spark SQL know that for those returned filters, it is possible > that the data source will not apply them to every row. So, Spark SQL should > use a Filter operator to evaluate those filters. However, if a filter is a > part of returned unhandledFilters, we should still push it down. For example, > our internal data sources do not override this method, if we do not push down > those filters, we are actually turning off the filter pushdown feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
[ https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Alexander Spitzer updated SPARK-12639: -- Description: SPARK-11661 improves handling of predicate pushdowns but has an unintended consequence of making the explain string more confusing. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something ) {code: title= "Query plan for source where nothing is handled by C* Source"} Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. {code: title = "Query plan for source where everything is handled by C* Source"} Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. Something like {code: title="Proposed Explain for Pushdown were none of the predicates are handled by the underlying source"} Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] HandledFilters: [] {code} was: SPARK-11661 improves handling of predicate pushdowns but has an unintended consequence of making the explain string more confusing. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something ) "Query plan for source where nothing is handled by C* Source" Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. "Query plan for source where everything is handled by C* Source" Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. > Improve Explain for DataSources with Handled Predicate Pushdowns > > > Key: SPARK-12639 > URL: https://issues.apache.org/jira/browse/SPARK-12639 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Russell Alexander Spitzer >Priority: Minor > > SPARK-11661 improves handling of predicate pushdowns but has an unintended > consequence of making the explain string more confusing. > It basically makes it seem as if a source is always pushing down all of the > filters (even those it cannot handle) > This can have a confusing effect (I kept checking my code to see where I had > broken something ) > {code: title= "Query plan for source where nothing is handled by C* Source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > Although the tell tale "Filter" step is present my first instinct would tell > me that the underlying source relation is using all of those filters. > {code: title = "Query plan for source where everything is handled by C* > Source"} > Scan > org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] > PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] > {code} > I think this would be much clearer if we changed the metadata key to > "HandledFilters" and only listed those handled fully by the underlying source. > Something like > {code: title="Proposed Explain for Pushdown were none of the predicates are > handled by the underlying source"} > Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) > +- Sca
[jira] [Created] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns
Russell Alexander Spitzer created SPARK-12639: - Summary: Improve Explain for DataSources with Handled Predicate Pushdowns Key: SPARK-12639 URL: https://issues.apache.org/jira/browse/SPARK-12639 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.0 Reporter: Russell Alexander Spitzer Priority: Minor SPARK-11661 improves handling of predicate pushdowns but has an unintended consequence of making the explain string more confusing. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something ) "Query plan for source where nothing is handled by C* Source" Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. "Query plan for source where everything is handled by C* Source" Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters
[ https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075690#comment-15075690 ] Russell Alexander Spitzer commented on SPARK-11661: --- This seems to have a slightly unintended consequence in the explain dialogue. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something :D ) {code: Title="Query plan for source where nothing is handled by C* Source"} Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. {code: Title="Query plan for source where *everything* is handled by C* Source"} Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. wdyt? > We should still pushdown filters returned by a data source's unhandledFilters > - > > Key: SPARK-11661 > URL: https://issues.apache.org/jira/browse/SPARK-11661 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0 > > > We added unhandledFilters interface to SPARK-10978. So, a data source has a > chance to let Spark SQL know that for those returned filters, it is possible > that the data source will not apply them to every row. So, Spark SQL should > use a Filter operator to evaluate those filters. However, if a filter is a > part of returned unhandledFilters, we should still push it down. For example, > our internal data sources do not override this method, if we do not push down > those filters, we are actually turning off the filter pushdown feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters
[ https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075690#comment-15075690 ] Russell Alexander Spitzer edited comment on SPARK-11661 at 12/31/15 3:44 AM: - This seems to have a slightly unintended consequence in the explain dialogue. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something :D ) {code: title="Query plan for source where nothing is handled by C* Source"} Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. {code: title="Query plan for source where *everything* is handled by C* Source"} Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. wdyt? was (Author: rspitzer): This seems to have a slightly unintended consequence in the explain dialogue. It basically makes it seem as if a source is always pushing down all of the filters (even those it cannot handle) This can have a confusing effect (I kept checking my code to see where I had broken something :D ) {code: Title="Query plan for source where nothing is handled by C* Source"} Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1)) +- Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} Although the tell tale "Filter" step is present my first instinct would tell me that the underlying source relation is using all of those filters. {code: Title="Query plan for source where *everything* is handled by C* Source"} Scan org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86] PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)] {code} I think this would be much clearer if we changed the metadata key to "HandledFilters" and only listed those handled fully by the underlying source. wdyt? > We should still pushdown filters returned by a data source's unhandledFilters > - > > Key: SPARK-11661 > URL: https://issues.apache.org/jira/browse/SPARK-11661 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Blocker > Fix For: 1.6.0 > > > We added unhandledFilters interface to SPARK-10978. So, a data source has a > chance to let Spark SQL know that for those returned filters, it is possible > that the data source will not apply them to every row. So, Spark SQL should > use a Filter operator to evaluate those filters. However, if a filter is a > part of returned unhandledFilters, we should still push it down. For example, > our internal data sources do not override this method, if we do not push down > those filters, we are actually turning off the filter pushdown feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10104) Consolidate different forms of table identifiers
[ https://issues.apache.org/jira/browse/SPARK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009923#comment-15009923 ] Russell Alexander Spitzer commented on SPARK-10104: --- Yep, see https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraCatalog.scala This is for the use case where someone has two separate Cassandra Clusters. This would allow an end user to join data between them. Having the extra Strings would solve the problem for us. I'll let [~alexliu68] fill in any details. > Consolidate different forms of table identifiers > > > Key: SPARK-10104 > URL: https://issues.apache.org/jira/browse/SPARK-10104 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to > represent table identifiers. We should only have one form and looks > TableIdentifier is the best one because it provides methods to get table > name, database name, return unquoted string, and return quoted string. > There will be TODOs having "SPARK-10104" in it. Those places need to be > updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10104) Consolidate different forms of table identifiers
[ https://issues.apache.org/jira/browse/SPARK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15009864#comment-15009864 ] Russell Alexander Spitzer commented on SPARK-10104: --- Previously with Seq[String] it was possible to extend the data within a TableIdentifier. This was used to allow the specification of a "Cluster", "Database" and Table. This made it possible for us segregate identifiers based on what cluster they were being read from. Currently we are trying to determine a workaround to store extra information in the TableIdentifier. [~alexliu68] > Consolidate different forms of table identifiers > > > Key: SPARK-10104 > URL: https://issues.apache.org/jira/browse/SPARK-10104 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Wenchen Fan > Fix For: 1.6.0 > > > Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to > represent table identifiers. We should only have one form and looks > TableIdentifier is the best one because it provides methods to get table > name, database name, return unquoted string, and return quoted string. > There will be TODOs having "SPARK-10104" in it. Those places need to be > updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Alexander Spitzer closed SPARK-11415. - Resolution: Not A Problem > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995934#comment-14995934 ] Russell Alexander Spitzer commented on SPARK-11415: --- I re-read a bunch of the Java.sql.date docs and now I think that the current code is actually ok based on the JavaDocs. I'll just have to make sure that we don't use the java.sql.Date(Long timestamp) constructor since that doesn't do the timezone wrapping which the valueOf method does. I find this behavior a bit odd but it seems this is a long known oddity of java.util.Date vs java.sql.Date > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14995915#comment-14995915 ] Russell Alexander Spitzer commented on SPARK-11415: --- Looks like there are still a lot of tests that need to be cleaned up, I'll take a look at this tomorrow > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11326) Split networking in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14992455#comment-14992455 ] Russell Alexander Spitzer commented on SPARK-11326: --- I'm not sure I see where Stand Alone mode has ever not been focused on reaching a feature parity with the other cluster management tools. Even recently https://issues.apache.org/jira/browse/SPARK-4751 added in the dynamic scaling ability which was previously only Yarn + Mesos based. I've always seen Stand Alone mode as a cluster manager for those users who don't want to run a Zookeeper maintained cluster. > Split networking in standalone mode > --- > > Key: SPARK-11326 > URL: https://issues.apache.org/jira/browse/SPARK-11326 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jacek Lewandowski > > h3.The idea > Currently, in standalone mode, all components, for all network connections > need to use the same secure token if they want to have any security ensured. > This ticket is intended to split the communication in standalone mode to make > it more like in Yarn mode - application internal communication, scheduler > internal communication and communication between the client and scheduler. > Such refactoring will allow for the scheduler (master, workers) to use a > distinct secret, which will remain unknown for the users. Similarly, it will > allow for better security in applications, because each application will be > able to use a distinct secret as well. > By providing Kerberos based SASL authentication/encryption for connections > between a client (Client or AppClient) and Spark Master, it will be possible > to introduce authentication and automatic generation of digest tokens and > safe sharing them among the application processes. > h3.User facing changes when running application > h4.General principles: > - conf: {{spark.authenticate.secret}} is *never sent* over the wire > - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire > - In all situations env variable will overwrite conf variable if present. > - In all situations when a user has to pass secret, it is better (safer) to > do this through env variable > - In work modes with multiple secrets we assume encrypted communication > between client and master, between driver and master, between master and > workers > > h4.Work modes and descriptions > h5.Client mode, single secret > h6.Configuration > - env: {{SPARK_AUTH_SECRET=secret}} or conf: > {{spark.authenticate.secret=secret}} > h6.Description > - The driver is running locally > - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: > {{spark.authenticate.secret}} > - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: > {{spark.authenticate.secret}} for connection to the master > - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it > will look for it in the worker configuration and it will find it there (its > presence is implied). > > h5.Client mode, multiple secrets > h6.Configuration > - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: > {{spark.app.authenticate.secret=secret}} > - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: > {{spark.submission.authenticate.secret=scheduler_secret}} > h6.Description > - The driver is running locally > - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: > {{spark.submission.authenticate.secret}} to connect to the master > - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor > conf: {{spark.submission.authenticate.secret}} > - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: > {{spark.app.authenticate.secret}} for communication with the executors > - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so > that the executors can use it to communicate with the driver > - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it > will set it in env: {{SPARK_AUTH_SECRET}} which will be read by > _ExecutorBackend_ afterwards and used for all the connections (with driver, > other executors and external shuffle service). > > h5.Cluster mode, single secret > h6.Configuration > - env: {{SPARK_AUTH_SECRET=secret}} or conf: > {{spark.authenticate.secret=secret}} > h6.Description > - The driver is run by _DriverRunner_ which is is a part of the worker > - The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: > {{spark.authenticate.secret}} > - The client will use either env: {{SPARK_AUTH_SECRET}} or conf: > {{spark.authenticate.secret}} for connection to the master and submit the > driver > - _DriverRunner_ will not find any secret in _DriverDescription_ so it will > look for it in the worker configuration and it will find it
[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982001#comment-14982001 ] Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 9:44 AM: - I've been thinking about this for a while, and I think the underlying issue is that the conversion before storing as an Int leads to a lot of strange behaviors. If we are going to have the Date type represent days from epoch we should most likely throw out all information outside of the granularity. Adding a test of {code} checkFromToJavaDate(new Date(0)){code} Shows the trouble of trying to take into account the more granular information The date will be converted to some hours before epoch by the timezone magic (if you live in america) then rounded down to -1. This means it fails the check because {code}[info] "19[69-12-3]1" did not equal "19[70-01-0]1" (DateTimeUtilsSuite.scala:68){code} This is my basic problem with integration, the operation of transforming a Date to and from a Catalyst Date is only idempotent if the value is created in the Locale TimeZone. For all other timezones there is a possibility that the subtraction of the local timezone offset could cause a "Floor" call to essential move the entire date back in time a day. was (Author: rspitzer): I've been thinking about this for a while, and I think the underlying issue is that the conversion before storing as an Int leads to a lot of strange behaviors. If we are going to have the Date type represent days from epoch we should most likely throw out all information outside of the granularity. Adding a test of {code} checkFromToJavaDate(new Date(0)){code} Shows the trouble of trying to take into account the more granular information The date will be converted to some hours before epoch by the timezone magic (if you live in america) then rounded down to -1. This means it fails the check because {code}[info] "19[69-12-3]1" did not equal "19[70-01-0]1" (DateTimeUtilsSuite.scala:68){code} > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982240#comment-14982240 ] Russell Alexander Spitzer commented on SPARK-11415: --- I think the underlying conflict here is that {{java.sql.Date}} has no timezone information. This means it is beholden on the end-user to properly align their {{Date}} object with UTC. This makes sense to me because otherwise wend up with a very difficult situation where only {{Date}} objects which match the Locale of the executor will be correctly aligned. So for example if I try to create a {{Catalyst.DateType}} with a {{java.sql.Date}} that is in UTC but i'm running in PDT the value will be aligned incorrectly (and also be returned as an incorrect {{java.sql.date}} since the time since epoch will be wrong). This requires an outside source (or user) to reformat their {Date}s dependent on the location of the Spark Cluster (or configuration of it's locale) if they don't want their Date to be corrupted. In C* the Driver has a `LocalDate` class to avoid this problem, so it is extremely clear when a specific `Year,Month,Day` tuple should be aligned against UTC. It also may be helpful to allow for a direct translation of {{Int}} -> {{Catalyst.DateType}} -> {{Int}} for those sources that can provide a days from Epoch. > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982024#comment-14982024 ] Russell Alexander Spitzer commented on SPARK-11415: --- I added another commit to fix up the tests. The test code will now run identically no matter what time zone it happens to be run in (even UTC). > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982024#comment-14982024 ] Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 7:14 AM: - I added another commit to fix up the tests. The test code will now run identically no matter what time zone it happens to be run in (even PDT). was (Author: rspitzer): I added another commit to fix up the tests. The test code will now run identically no matter what time zone it happens to be run in (even UTC). > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982001#comment-14982001 ] Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:46 AM: - I've been thinking about this for a while, and I think the underlying issue is that the conversion before storing as an Int leads to a lot of strange behaviors. If we are going to have the Date type represent days from epoch we should most likely throw out all information outside of the granularity. Adding a test of {code} checkFromToJavaDate(new Date(0)){code} Shows the trouble of trying to take into account the more granular information The date will be converted to some hours before epoch by the timezone magic (if you live in america) then rounded down to -1. This means it fails the check because {code}[info] "19[69-12-3]1" did not equal "19[70-01-0]1" (DateTimeUtilsSuite.scala:68){code} was (Author: rspitzer): I've been thinking about this for a while, and I think the underlying issue is that the conversion before storing as an Int leads to a lot of strange behaviors. If we are going to have the Date type represent days from epoch we should most likely throw out all information outside of the granularity. Adding a test of {code} checkFromToJavaDate(new Date(0)){code} Shows the trouble of trying to take into account the more granular information The date will be converted to some hours before epoch by the timezone magic (if you live in america) then rounded down to -1. This means it fails the check because [info] "19[69-12-3]1" did not equal "19[70-01-0]1" (DateTimeUtilsSuite.scala:68) > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982001#comment-14982001 ] Russell Alexander Spitzer commented on SPARK-11415: --- I've been thinking about this for a while, and I think the underlying issue is that the conversion before storing as an Int leads to a lot of strange behaviors. If we are going to have the Date type represent days from epoch we should most likely throw out all information outside of the granularity. Adding a test of {code} checkFromToJavaDate(new Date(0)){code} Shows the trouble of trying to take into account the more granular information The date will be converted to some hours before epoch by the timezone magic (if you live in america) then rounded down to -1. This means it fails the check because [info] "19[69-12-3]1" did not equal "19[70-01-0]1" (DateTimeUtilsSuite.scala:68) > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981968#comment-14981968 ] Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:20 AM: - Some tests are now, broken investigating The fix in SPARK-6785 seems to be off to me In it 1 second before epoch and 1 second after epoch are 1 Day apart. This should not be true. They should both be equivelently far (in days) from epoch 0 Actually i'm not sure about this now... was (Author: rspitzer): Some tests are now, broken investigating The fix in SPARK-6785 seems to be off to me In it 1 second before epoch and 1 second after epoch are 1 Day apart. This should not be true. They should both be equivelently far (in days) from epoch 0 > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981968#comment-14981968 ] Russell Alexander Spitzer commented on SPARK-11415: --- Some tests are now, broken investigating The fix in SPARK-6785 seems to be off to me In it 1 second before epoch and 1 second after epoch are 1 Day apart. This should not be true. They should both be equivelently far (in days) from epoch 0 > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Alexander Spitzer updated SPARK-11415: -- Affects Version/s: 1.5.1 > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
[ https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14981868#comment-14981868 ] Russell Alexander Spitzer commented on SPARK-11415: --- https://github.com/apache/spark/pull/9369/files PR Submitted for fix > Catalyst DateType Shifts Input Data by Local Timezone > - > > Key: SPARK-11415 > URL: https://issues.apache.org/jira/browse/SPARK-11415 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Russell Alexander Spitzer > > I've been running type tests for the Spark Cassandra Connector and couldn't > get a consistent result for java.sql.Date. I investigated and noticed the > following code is used to create Catalyst.DateTypes > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 > {code} > /** >* Returns the number of days since epoch from from java.sql.Date. >*/ > def fromJavaDate(date: Date): SQLDate = { > millisToDays(date.getTime) > } > {code} > But millisToDays does not abide by this contract, shifting the underlying > timestamp to the local timezone before calculating the days from epoch. This > causes the invocation to move the actual date around. > {code} > // we should use the exact day as Int, for example, (year, month, day) -> > day > def millisToDays(millisUtc: Long): SQLDate = { > // SPARK-6785: use Math.floor so negative number of days (dates before > 1970) > // will correctly work as input for function toJavaDate(Int) > val millisLocal = millisUtc + > threadLocalLocalTimeZone.get().getOffset(millisUtc) > Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt > } > {code} > The inverse function also incorrectly shifts the timezone > {code} > // reverse of millisToDays > def daysToMillis(days: SQLDate): Long = { > val millisUtc = days.toLong * MILLIS_PER_DAY > millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) > } > {code} > https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 > This will cause 1-off errors and could cause significant shifts in data if > the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone
Russell Alexander Spitzer created SPARK-11415: - Summary: Catalyst DateType Shifts Input Data by Local Timezone Key: SPARK-11415 URL: https://issues.apache.org/jira/browse/SPARK-11415 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.5.0 Reporter: Russell Alexander Spitzer I've been running type tests for the Spark Cassandra Connector and couldn't get a consistent result for java.sql.Date. I investigated and noticed the following code is used to create Catalyst.DateTypes https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144 {code} /** * Returns the number of days since epoch from from java.sql.Date. */ def fromJavaDate(date: Date): SQLDate = { millisToDays(date.getTime) } {code} But millisToDays does not abide by this contract, shifting the underlying timestamp to the local timezone before calculating the days from epoch. This causes the invocation to move the actual date around. {code} // we should use the exact day as Int, for example, (year, month, day) -> day def millisToDays(millisUtc: Long): SQLDate = { // SPARK-6785: use Math.floor so negative number of days (dates before 1970) // will correctly work as input for function toJavaDate(Int) val millisLocal = millisUtc + threadLocalLocalTimeZone.get().getOffset(millisUtc) Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt } {code} The inverse function also incorrectly shifts the timezone {code} // reverse of millisToDays def daysToMillis(days: SQLDate): Long = { val millisUtc = days.toLong * MILLIS_PER_DAY millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc) } {code} https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93 This will cause 1-off errors and could cause significant shifts in data if the underlying data is worked on in different timezones than UTC. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation
Russell Alexander Spitzer created SPARK-10978: - Summary: Allow PrunedFilterScan to eliminate predicates from further evaluation Key: SPARK-10978 URL: https://issues.apache.org/jira/browse/SPARK-10978 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.5.0, 1.4.0, 1.3.0 Reporter: Russell Alexander Spitzer Fix For: 1.6.0 Currently PrunedFilterScan allows implementors to push down predicates to an underlying datasource. This is done solely as an optimization as the predicate will be reapplied on the Spark side as well. This allows for bloom-filter like operations but ends up doing a redundant scan for those sources which can do accurate pushdowns. In addition it makes it difficult for underlying sources to accept queries which reference non-existent to provide ancillary function. In our case we allow a solr query to be passed in via a non-existent solr_query column. Since this column is not returned when Spark does a filter on "solr_query" nothing passes. Suggestion on the ML from [~marmbrus] {quote} We have to try and maintain binary compatibility here, so probably the easiest thing to do here would be to add a method to the class. Perhaps something like: def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters By default, this could return all filters so behavior would remain the same, but specific implementations could override it. There is still a chance that this would conflict with existing methods, but hopefully that would not be a problem in practice. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14937471#comment-14937471 ] Russell Alexander Spitzer commented on SPARK-9472: -- Yeah that's the workaround we recommend now, but that requires every application to manually specify the files. We just want our distribution to not require that much manual intervention (normally we automatically pass in required hadoop conf based on the users security setup via spark.hadoop) > Consistent hadoop config for streaming > -- > > Key: SPARK-9472 > URL: https://issues.apache.org/jira/browse/SPARK-9472 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger >Assignee: Cody Koeninger >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming
[ https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936293#comment-14936293 ] Russell Alexander Spitzer commented on SPARK-9472: -- Any thoughts on back-porting this to previous lines? We just hit this trying to pass in hadoop variables via "spark.hadoop" and having them not register with the Streaming Context via getOrCreate. > Consistent hadoop config for streaming > -- > > Key: SPARK-9472 > URL: https://issues.apache.org/jira/browse/SPARK-9472 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Cody Koeninger >Assignee: Cody Koeninger >Priority: Minor > Fix For: 1.5.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison
[ https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574983#comment-14574983 ] Russell Alexander Spitzer commented on SPARK-6987: -- Or being able to specify an identifier for each spark worker that wasn't dependent on ip? > Node Locality is determined with String Matching instead of Inet Comparison > --- > > Key: SPARK-6987 > URL: https://issues.apache.org/jira/browse/SPARK-6987 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Russell Alexander Spitzer > > When determining whether or not a task can be run NodeLocal the > TaskSetManager ends up using a direct string comparison between the > preferredIp and the executor's bound interface. > https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880 > https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490 > This means that the preferredIp must be a direct string match of the ip the > the worker is bound to. This means that apis which are gathering data from > other distributed sources must develop their own mapping between the > interfaces bound (or exposed) by the external sources and the interface bound > by the Spark executor since these may be different. > For example, Cassandra exposes a broadcast rpc address which doesn't have to > match the address which the service is bound to. This means when adding > preferredLocation data we must add both the rpc and the listen address to > ensure that we can get a string match (and of course we are out of luck if > Spark has been bound on to another interface). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison
[ https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572218#comment-14572218 ] Russell Alexander Spitzer commented on SPARK-6987: -- For Inet comparison I think the best think you can do is compare resolved hostnames but even that isn't really great. I've been looking into other solutions but haven't found anything really satisfactory. Having each worker/executor list all interfaces could be useful, that way any service as long as it is bound to a real interface on the machine could be properly matched. > Node Locality is determined with String Matching instead of Inet Comparison > --- > > Key: SPARK-6987 > URL: https://issues.apache.org/jira/browse/SPARK-6987 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Russell Alexander Spitzer > > When determining whether or not a task can be run NodeLocal the > TaskSetManager ends up using a direct string comparison between the > preferredIp and the executor's bound interface. > https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880 > https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490 > This means that the preferredIp must be a direct string match of the ip the > the worker is bound to. This means that apis which are gathering data from > other distributed sources must develop their own mapping between the > interfaces bound (or exposed) by the external sources and the interface bound > by the Spark executor since these may be different. > For example, Cassandra exposes a broadcast rpc address which doesn't have to > match the address which the service is bound to. This means when adding > preferredLocation data we must add both the rpc and the listen address to > ensure that we can get a string match (and of course we are out of luck if > Spark has been bound on to another interface). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524372#comment-14524372 ] Russell Alexander Spitzer commented on SPARK-6069: -- Running with --conf spark.files.userClassPathFirst=true yields a different error {code} scala> cc.sql("SELECT * FROM test.fun as a JOIN test.fun as b ON (a.k = b.v)").collect 15/05/01 17:24:34 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.lang.NoClassDefFoundError: org/apache/spark/Partition at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:800) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) at java.net.URLClassLoader.access$100(URLClassLoader.java:71) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42) at org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:412) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30) at org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57) at org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.spark.Partition at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at org.apache.spark.executor.ChildExecutorURLClassLoade
[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14
[ https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524363#comment-14524363 ] Russell Alexander Spitzer commented on SPARK-6069: -- We've seen the same issue while developing the Spark Cassandra Connector. Unless the connector lib is loaded via spark.executor.extraClassPath, kryoSerializaition for joins always returns a classNotFound even though all operations which don't require a shuffle are fine. {code} com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.spark.sql.cassandra.CassandraSQLRow at com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138) at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:80) at org.apache.spark.sql.execution.joins.ShuffledHashJoin$$anonfun$execute$1.apply(ShuffledHashJoin.scala:46) at org.apache.spark.sql.execution.joins.ShuffledHashJoin$$anonfun$execute$1.apply(ShuffledHashJoin.scala:45) at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} Adding the jar to executorExtraClasspath rather than --jars solves the issue. > Deserialization Error ClassNotFoundException with Kryo, Guava 14 > > > Key: SPARK-6069 > URL: https://issues.apache.org/jira/browse/SPARK-6069 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.1 > Environment: Standalone one worker cluster on localhost, or any > cluster >Reporter: Pat Ferrel >Priority: Critical > > A class is contained in the jars passed in when creating a context. It is > registered with kryo. The class (Guava HashBiMap) is created correctly from > an RDD and broadcast but the deserialization fails with ClassNotFound. > The work around is to hard code the path to the jar and make it available on > all workers. Hard code because we are creating a library so there is no easy > way to pass in to the app something like: > spark.executor.extraClassPath /path/to/some.jar -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14508170#comment-14508170 ] Russell Alexander Spitzer commented on SPARK-7061: -- Thanks! I didn't see it come up in a general text search, only the previous one I linked. But that is the issue (6299) > Case Classes Cannot be Repartitioned/Shuffled in Spark REPL > --- > > Key: SPARK-7061 > URL: https://issues.apache.org/jira/browse/SPARK-7061 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 1.2.1 > Environment: Single Node Stand Alone Spark Shell >Reporter: Russell Alexander Spitzer >Priority: Minor > > Running the following code in the spark shell against a stand alone master. > {code} > case class CustomerID( id:Int) > sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1) > {code} > Gives the following exception > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:274) > at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) > at > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) > at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > at > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > at > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > at > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at > org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > at > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > at scala.collection.AbstractIterator.to(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > at > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) > at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at > org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:
[jira] [Updated] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Russell Alexander Spitzer updated SPARK-7061: - Description: Running the following code in the spark shell against a stand alone master. {code} case class CustomerID( id:Int) sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1) {code} Gives the following exception {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} I believe this is related to the shuffle code since the following other examples also give this exception. {code} val idsOfInterest = sc.parallelize(1 to 1000).map(CustomerID(_)).groupBy(_.id).take(1) val idsOfInterest = sc.parallelize(1 to 1000).map( x => (CustomerID(_),x)).groupByKey().take(1) val idsOfInterest = sc.parallelize(1 to 1000).map( x => (CustomerID(_),x)).
[jira] [Created] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL
Russell Alexander Spitzer created SPARK-7061: Summary: Case Classes Cannot be Repartitioned/Shuffled in Spark REPL Key: SPARK-7061 URL: https://issues.apache.org/jira/browse/SPARK-7061 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 1.2.1 Environment: Single Node Stand Alone Spark Shell Reporter: Russell Alexander Spitzer Priority: Minor Running the following code in the spark shell against a stand alone master. {code} case class CustomerID( id:Int) sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1) {code} Gives the following exception {code} org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:274) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} I believe this is related to the shuffle code since the following other example
[jira] [Created] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison
Russell Alexander Spitzer created SPARK-6987: Summary: Node Locality is determined with String Matching instead of Inet Comparison Key: SPARK-6987 URL: https://issues.apache.org/jira/browse/SPARK-6987 Project: Spark Issue Type: Bug Components: Scheduler, Spark Core Affects Versions: 1.3.0, 1.2.0 Reporter: Russell Alexander Spitzer When determining whether or not a task can be run NodeLocal the TaskSetManager ends up using a direct string comparison between the preferredIp and the executor's bound interface. https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880 https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490 This means that the preferredIp must be a direct string match of the ip the the worker is bound to. This means that apis which are gathering data from other distributed sources must develop their own mapping between the interfaces bound (or exposed) by the external sources and the interface bound by the Spark executor since these may be different. For example, Cassandra exposes a broadcast rpc address which doesn't have to match the address which the service is bound to. This means when adding preferredLocation data we must add both the rpc and the listen address to ensure that we can get a string match (and of course we are out of luck if Spark has been bound on to another interface). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org