[jira] [Commented] (SPARK-13928) Move org.apache.spark.Logging into org.apache.spark.internal.Logging

2016-06-14 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15330550#comment-15330550
 ] 

Russell Alexander Spitzer commented on SPARK-13928:
---

So users(like me ;) ) need to write their own Logging trait now? I'm a little 
confused based on the description. 

> Move org.apache.spark.Logging into org.apache.spark.internal.Logging
> 
>
> Key: SPARK-13928
> URL: https://issues.apache.org/jira/browse/SPARK-13928
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Logging was made private in Spark 2.0. If we move it, then users would be 
> able to create a Logging trait themselves to avoid changing their own code. 
> Alternatively, we can also provide in a compatibility package that adds 
> logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2016-06-12 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326793#comment-15326793
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I honestly am super confused about what is supposed to happen vs what isn't. I 
ended up just tuning the connector to be aware of this so at least dates don't 
get corrupted when coming out of C*. 

https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/types/TypeConverter.scala#L314-L318

I blame java.sql.date for being ... strange.

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory

2016-02-25 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15168168#comment-15168168
 ] 

Russell Alexander Spitzer commented on SPARK-11293:
---

Saw a similar issue loading the Friendster graph into Tinkerpop and running the 
BulkVertexLoader with the Shuffled Vertex RDD persisted to disk. 

> Spillable collections leak shuffle memory
> -
>
> Key: SPARK-11293
> URL: https://issues.apache.org/jira/browse/SPARK-11293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.1, 1.4.1, 1.5.1, 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> I discovered multiple leaks of shuffle memory while working on my memory 
> manager consolidation patch, which added the ability to do strict memory leak 
> detection for the bookkeeping that used to be performed by the 
> ShuffleMemoryManager. This uncovered a handful of places where tasks can 
> acquire execution/shuffle memory but never release it, starving themselves of 
> memory.
> Problems that I found:
> * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution 
> memory.
> * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a 
> {{CompletionIterator}}.
> * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing 
> its resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5

2016-02-22 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15158213#comment-15158213
 ] 

Russell Alexander Spitzer commented on SPARK-13289:
---

same

> Word2Vec generate infinite distances when numIterations>5
> -
>
> Key: SPARK-13289
> URL: https://issues.apache.org/jira/browse/SPARK-13289
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
> Environment: Linux, Scala
>Reporter: Qi Dai
>  Labels: features
>
> I recently ran some word2vec experiments on a cluster with 50 executors on 
> some large text dataset but find out that when number of iterations is larger 
> than 5 the distance between words will be all infinite. My code looks like 
> this:
> val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" 
> ").toSeq)
> import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel}
> val word2vec = new 
> Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5)
> val model = word2vec.fit(text)
> val synonyms = model.findSynonyms("who", 40)
> for((synonym, cosineSimilarity) <- synonyms) {
>   println(s"$synonym $cosineSimilarity")
> }
> The results are: 
> to Infinity
> and Infinity
> that Infinity
> with Infinity
> said Infinity
> it Infinity
> by Infinity
> be Infinity
> have Infinity
> he Infinity
> has Infinity
> his Infinity
> an Infinity
> ) Infinity
> not Infinity
> who Infinity
> I Infinity
> had Infinity
> their Infinity
> were Infinity
> they Infinity
> but Infinity
> been Infinity
> I tried many different datasets and different words for finding synonyms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-26 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118221#comment-15118221
 ] 

Russell Alexander Spitzer commented on SPARK-12639:
---

https://github.com/apache/spark/pull/10929/files [~yhuai] Made another PR for 
the * solution. I'm not sure where this should be documented but if this is 
what you are looking for I'll find a place to note it :)

> Improve Explain for DataSources with Handled Predicate Pushdowns
> 
>
> Key: SPARK-12639
> URL: https://issues.apache.org/jira/browse/SPARK-12639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
>
> SPARK-11661 improves handling of predicate pushdowns but has an unintended 
> consequence of making the explain string more confusing.
> It basically makes it seem as if a source is always pushing down all of the 
> filters (even those it cannot handle)
> This can have a confusing effect (I kept checking my code to see where I had 
> broken something  )
> {code: title= "Query plan for source where nothing is handled by C* Source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> Although the tell tale "Filter" step is present my first instinct would tell 
> me that the underlying source relation is using all of those filters.
> {code: title = "Query plan for source where everything is handled by C* 
> Source"}
> Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> I think this would be much clearer if we changed the metadata key to 
> "HandledFilters" and only listed those handled fully by the underlying source.
> Something like
> {code: title="Proposed Explain for Pushdown were none of the predicates are 
> handled by the underlying source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  HandledFilters: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-26 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15118222#comment-15118222
 ] 

Russell Alexander Spitzer commented on SPARK-12639:
---

I keep forgetting about your bot!

> Improve Explain for DataSources with Handled Predicate Pushdowns
> 
>
> Key: SPARK-12639
> URL: https://issues.apache.org/jira/browse/SPARK-12639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
>
> SPARK-11661 improves handling of predicate pushdowns but has an unintended 
> consequence of making the explain string more confusing.
> It basically makes it seem as if a source is always pushing down all of the 
> filters (even those it cannot handle)
> This can have a confusing effect (I kept checking my code to see where I had 
> broken something  )
> {code: title= "Query plan for source where nothing is handled by C* Source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> Although the tell tale "Filter" step is present my first instinct would tell 
> me that the underlying source relation is using all of those filters.
> {code: title = "Query plan for source where everything is handled by C* 
> Source"}
> Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> I think this would be much clearer if we changed the metadata key to 
> "HandledFilters" and only listed those handled fully by the underlying source.
> Something like
> {code: title="Proposed Explain for Pushdown were none of the predicates are 
> handled by the underlying source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  HandledFilters: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12143) When column type is binary, select occurs ClassCastExcption in Beeline.

2016-01-22 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113337#comment-15113337
 ] 

Russell Alexander Spitzer commented on SPARK-12143:
---

I see this as well using beeline to read from spark-sql-thriftserver on a table 
with a binary column.

{code}
0: jdbc:hive2://localhost:1> describe bios;
++++
|  col_name  | data_type  |  comment   |
++++
| user_name  | string | from deserializer  |
| bio| binary | from deserializer  |
++++
2 rows selected (0.03 seconds)
0: jdbc:hive2://localhost:1> select * from bios;
Error: java.lang.String cannot be cast to [B (state=,code=0)
{code}

> When column type is binary, select occurs ClassCastExcption in Beeline.
> ---
>
> Key: SPARK-12143
> URL: https://issues.apache.org/jira/browse/SPARK-12143
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: meiyoula
>
> In Beeline, execute below sql:
> 1. create table bb(bi binary);
> 2. load data inpath 'tmp/data' into table bb;
> 3.select * from bb;
>  Error: java.lang.ClassCastException: java.lang.String cannot be cast to [B 
> (state=, code=0)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-07 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088537#comment-15088537
 ] 

Russell Alexander Spitzer commented on SPARK-12639:
---

This is a regression of SPARK-11390

> Improve Explain for DataSources with Handled Predicate Pushdowns
> 
>
> Key: SPARK-12639
> URL: https://issues.apache.org/jira/browse/SPARK-12639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
>
> SPARK-11661 improves handling of predicate pushdowns but has an unintended 
> consequence of making the explain string more confusing.
> It basically makes it seem as if a source is always pushing down all of the 
> filters (even those it cannot handle)
> This can have a confusing effect (I kept checking my code to see where I had 
> broken something  )
> {code: title= "Query plan for source where nothing is handled by C* Source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> Although the tell tale "Filter" step is present my first instinct would tell 
> me that the underlying source relation is using all of those filters.
> {code: title = "Query plan for source where everything is handled by C* 
> Source"}
> Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> I think this would be much clearer if we changed the metadata key to 
> "HandledFilters" and only listed those handled fully by the underlying source.
> Something like
> {code: title="Proposed Explain for Pushdown were none of the predicates are 
> handled by the underlying source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  HandledFilters: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-07 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088558#comment-15088558
 ] 

Russell Alexander Spitzer commented on SPARK-12639:
---

https://github.com/apache/spark/pull/10655 [~yhuai]

> Improve Explain for DataSources with Handled Predicate Pushdowns
> 
>
> Key: SPARK-12639
> URL: https://issues.apache.org/jira/browse/SPARK-12639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
>
> SPARK-11661 improves handling of predicate pushdowns but has an unintended 
> consequence of making the explain string more confusing.
> It basically makes it seem as if a source is always pushing down all of the 
> filters (even those it cannot handle)
> This can have a confusing effect (I kept checking my code to see where I had 
> broken something  )
> {code: title= "Query plan for source where nothing is handled by C* Source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> Although the tell tale "Filter" step is present my first instinct would tell 
> me that the underlying source relation is using all of those filters.
> {code: title = "Query plan for source where everything is handled by C* 
> Source"}
> Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> I think this would be much clearer if we changed the metadata key to 
> "HandledFilters" and only listed those handled fully by the underlying source.
> Something like
> {code: title="Proposed Explain for Pushdown were none of the predicates are 
> handled by the underlying source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  HandledFilters: []
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2016-01-04 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082357#comment-15082357
 ] 

Russell Alexander Spitzer commented on SPARK-11661:
---

Yeah sure, np
https://issues.apache.org/jira/browse/SPARK-12639, I'll work on the PR later 
this week

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-04 Thread Russell Alexander Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Alexander Spitzer updated SPARK-12639:
--
Description: 
SPARK-11661 improves handling of predicate pushdowns but has an unintended 
consequence of making the explain string more confusing.

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)
This can have a confusing effect (I kept checking my code to see where I had 
broken something  )
{code: title= "Query plan for source where nothing is handled by C* Source"}
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.
{code: title = "Query plan for source where everything is handled by C* Source"}
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}
I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.

Something like
{code: title="Proposed Explain for Pushdown were none of the predicates are 
handled by the underlying source"}
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 HandledFilters: []
{code}



  was:
SPARK-11661 improves handling of predicate pushdowns but has an unintended 
consequence of making the explain string more confusing.

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)
This can have a confusing effect (I kept checking my code to see where I had 
broken something  )
"Query plan for source where nothing is handled by C* Source"
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.
"Query plan for source where everything is handled by C* Source"
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.


> Improve Explain for DataSources with Handled Predicate Pushdowns
> 
>
> Key: SPARK-12639
> URL: https://issues.apache.org/jira/browse/SPARK-12639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Russell Alexander Spitzer
>Priority: Minor
>
> SPARK-11661 improves handling of predicate pushdowns but has an unintended 
> consequence of making the explain string more confusing.
> It basically makes it seem as if a source is always pushing down all of the 
> filters (even those it cannot handle)
> This can have a confusing effect (I kept checking my code to see where I had 
> broken something  )
> {code: title= "Query plan for source where nothing is handled by C* Source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> Although the tell tale "Filter" step is present my first instinct would tell 
> me that the underlying source relation is using all of those filters.
> {code: title = "Query plan for source where everything is handled by C* 
> Source"}
> Scan 
> org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
>  PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
> {code}
> I think this would be much clearer if we changed the metadata key to 
> "HandledFilters" and only listed those handled fully by the underlying source.
> Something like
> {code: title="Proposed Explain for Pushdown were none of the predicates are 
> handled by the underlying source"}
> Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
> +- 

[jira] [Created] (SPARK-12639) Improve Explain for DataSources with Handled Predicate Pushdowns

2016-01-04 Thread Russell Alexander Spitzer (JIRA)
Russell Alexander Spitzer created SPARK-12639:
-

 Summary: Improve Explain for DataSources with Handled Predicate 
Pushdowns
 Key: SPARK-12639
 URL: https://issues.apache.org/jira/browse/SPARK-12639
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Russell Alexander Spitzer
Priority: Minor


SPARK-11661 improves handling of predicate pushdowns but has an unintended 
consequence of making the explain string more confusing.

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)
This can have a confusing effect (I kept checking my code to see where I had 
broken something  )
"Query plan for source where nothing is handled by C* Source"
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.
"Query plan for source where everything is handled by C* Source"
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-12-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075690#comment-15075690
 ] 

Russell Alexander Spitzer commented on SPARK-11661:
---

This seems to have a slightly unintended consequence in the explain dialogue. 

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)

This can have a confusing effect (I kept checking my code to see where I had 
broken something :D )

{code: Title="Query plan for source where nothing is handled by C* Source"}
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.

{code: Title="Query plan for source where *everything* is handled by C* Source"}
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}


I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.

wdyt?

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11661) We should still pushdown filters returned by a data source's unhandledFilters

2015-12-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075690#comment-15075690
 ] 

Russell Alexander Spitzer edited comment on SPARK-11661 at 12/31/15 3:44 AM:
-

This seems to have a slightly unintended consequence in the explain dialogue. 

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)

This can have a confusing effect (I kept checking my code to see where I had 
broken something :D )

{code: title="Query plan for source where nothing is handled by C* Source"}
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.

{code: title="Query plan for source where *everything* is handled by C* Source"}
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}


I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.

wdyt?


was (Author: rspitzer):
This seems to have a slightly unintended consequence in the explain dialogue. 

It basically makes it seem as if a source is always pushing down all of the 
filters (even those it cannot handle)

This can have a confusing effect (I kept checking my code to see where I had 
broken something :D )

{code: Title="Query plan for source where nothing is handled by C* Source"}
Filter a#71 = 1) && (b#72 = 2)) && (c#73 = 1)) && (e#75 = 1))
+- Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@4b9cf75c[a#71,b#72,c#73,d#74,e#75,f#76,g#77,h#78]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}
Although the tell tale "Filter" step is present my first instinct would tell me 
that the underlying source relation is using all of those filters.

{code: Title="Query plan for source where *everything* is handled by C* Source"}
Scan 
org.apache.spark.sql.cassandra.CassandraSourceRelation@55d4456c[a#79,b#80,c#81,d#82,e#83,f#84,g#85,h#86]
 PushedFilters: [EqualTo(a,1), EqualTo(b,2), EqualTo(c,1), EqualTo(e,1)]
{code}


I think this would be much clearer if we changed the metadata key to 
"HandledFilters" and only listed those handled fully by the underlying source.

wdyt?

> We should still pushdown filters returned by a data source's unhandledFilters
> -
>
> Key: SPARK-11661
> URL: https://issues.apache.org/jira/browse/SPARK-11661
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.6.0
>
>
> We added unhandledFilters interface to SPARK-10978. So, a data source has a 
> chance to let Spark SQL know that for those returned filters, it is possible 
> that the data source will not apply them to every row. So, Spark SQL should 
> use a Filter operator to evaluate those filters. However, if a filter is a 
> part of returned unhandledFilters, we should still push it down. For example, 
> our internal data sources do not override this method, if we do not push down 
> those filters, we are actually turning off the filter pushdown feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10104) Consolidate different forms of table identifiers

2015-11-17 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009864#comment-15009864
 ] 

Russell Alexander Spitzer commented on SPARK-10104:
---

Previously with Seq[String] it was possible to extend the data within a 
TableIdentifier. This was used to allow the specification of a "Cluster", 
"Database" and Table. This made it possible for us segregate identifiers based 
on what cluster they were being read from. 

Currently we are trying to determine a workaround to store extra information in 
the TableIdentifier. [~alexliu68] 

> Consolidate different forms of table identifiers
> 
>
> Key: SPARK-10104
> URL: https://issues.apache.org/jira/browse/SPARK-10104
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to 
> represent table identifiers. We should only have one form and looks 
> TableIdentifier is the best one because it provides methods to get table 
> name, database name, return unquoted string, and return quoted string. 
> There will be TODOs having "SPARK-10104" in it. Those places need to be 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10104) Consolidate different forms of table identifiers

2015-11-17 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15009923#comment-15009923
 ] 

Russell Alexander Spitzer commented on SPARK-10104:
---

Yep, see 
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraCatalog.scala

This is for the use case where someone has two separate Cassandra Clusters. 
This would allow an end user to join data between them. Having the extra 
Strings would solve the problem for us. I'll let [~alexliu68] fill in any 
details. 

> Consolidate different forms of table identifiers
> 
>
> Key: SPARK-10104
> URL: https://issues.apache.org/jira/browse/SPARK-10104
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>
> Right now, we have QualifiedTableName, TableIdentifier, and Seq[String] to 
> represent table identifiers. We should only have one form and looks 
> TableIdentifier is the best one because it provides methods to get table 
> name, database name, return unquoted string, and return quoted string. 
> There will be TODOs having "SPARK-10104" in it. Those places need to be 
> updated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-11-08 Thread Russell Alexander Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Alexander Spitzer closed SPARK-11415.
-
Resolution: Not A Problem

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-11-08 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995915#comment-14995915
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

Looks like there are still a lot of tests that need to be cleaned up, I'll take 
a look at this tomorrow

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-11-08 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14995934#comment-14995934
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I re-read a bunch of the Java.sql.date docs and now I think that the current 
code is actually ok based on the JavaDocs. I'll just have to make sure that we 
don't use the java.sql.Date(Long timestamp) constructor since that doesn't do 
the timezone wrapping which the valueOf method does. 

I find this behavior a bit odd but it seems this is a long known oddity of 
java.util.Date vs java.sql.Date

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11326) Split networking in standalone mode

2015-11-05 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14992455#comment-14992455
 ] 

Russell Alexander Spitzer commented on SPARK-11326:
---

I'm not sure I see where Stand Alone mode has ever not been focused on reaching 
a feature parity with the other cluster management tools. Even recently 
https://issues.apache.org/jira/browse/SPARK-4751 added in the dynamic scaling 
ability which was previously only Yarn + Mesos based. I've always seen Stand 
Alone mode as a cluster manager for those users who don't want to run a 
Zookeeper maintained cluster.

> Split networking in standalone mode
> ---
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication, scheduler 
> internal communication and communication between the client and scheduler. 
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing Kerberos based SASL authentication/encryption for connections 
> between a client (Client or AppClient) and Spark Master, it will be possible 
> to introduce authentication and automatic generation of digest tokens and 
> safe sharing them among the application processes. 
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> _ExecutorBackend_ afterwards and used for all the connections (with driver, 
> other executors and external shuffle service).
> 
> h5.Cluster mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is run by _DriverRunner_ which is is a part of the worker
> - The client will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The client will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master and submit the 
> driver
> - _DriverRunner_ will not find any secret in _DriverDescription_ so it will 
> look for it in the worker configuration and it will find it there (its 
> 

[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981968#comment-14981968
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981968#comment-14981968
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:20 AM:
-

Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

Actually i'm not sure about this now...


was (Author: rspitzer):
Some tests are now, broken investigating 

The fix in SPARK-6785 seems to be off to me

In it 1 second before epoch and 1 second after epoch are 1 Day apart. This 
should not be true. They should both be equivelently far (in days) from epoch 0

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982024#comment-14982024
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 7:14 AM:
-

I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even PDT).


was (Author: rspitzer):
I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even UTC).

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982024#comment-14982024
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I added another commit to fix up the tests. The test code will now run 
identically no matter what time zone it happens to be run in (even UTC).

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982001#comment-14982001
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 6:46 AM:
-

I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
{code}[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68){code}



was (Author: rspitzer):
I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68)


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982001#comment-14982001
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68)


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982001#comment-14982001
 ] 

Russell Alexander Spitzer edited comment on SPARK-11415 at 10/30/15 9:44 AM:
-

I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
{code}[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68){code}

This is my basic problem with integration, the operation of transforming a Date 
to and from a Catalyst Date is only idempotent if the value is created in the 
Locale TimeZone. For all other timezones there is a possibility that the 
subtraction of the local timezone offset could cause a "Floor" call to 
essential move the entire date back in time a day. 



was (Author: rspitzer):
I've been thinking about this for a while, and I think the underlying issue is 
that the conversion before storing as an Int leads to a lot of strange 
behaviors. If we are going to have the Date type represent days from epoch we 
should most likely throw out all information outside of the granularity.

Adding a test of  
{code} checkFromToJavaDate(new Date(0)){code} 
Shows the trouble of trying to take into account the more granular information 

The date will be converted to some hours before epoch by the timezone magic (if 
you live in america) then rounded down to -1. This means it fails the check 
because
{code}[info]   "19[69-12-3]1" did not equal "19[70-01-0]1" 
(DateTimeUtilsSuite.scala:68){code}


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982240#comment-14982240
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

I think the underlying conflict here is that {{java.sql.Date}} has no timezone 
information. This means it is beholden on the end-user to properly align their 
{{Date}} object with UTC. This makes sense to me because otherwise wend up with 
a very difficult situation where only {{Date}} objects which match the Locale 
of the executor will be correctly aligned. So for example if I try to create a 
{{Catalyst.DateType}} with a {{java.sql.Date}} that is in UTC but i'm running 
in PDT the value will be aligned incorrectly (and also be returned as an 
incorrect {{java.sql.date}} since the time since epoch will be wrong). This 
requires an outside source (or user) to reformat their {Date}s dependent on the 
location of the Spark Cluster (or configuration of it's locale) if they don't 
want their Date to be corrupted. 

In C* the Driver has a `LocalDate` class to avoid this problem, so it is 
extremely clear when a specific `Year,Month,Day` tuple should be aligned 
against UTC. 

It also may be helpful to allow for a direct translation of {{Int}} -> 
{{Catalyst.DateType}} -> {{Int}} for those sources that can provide a days from 
Epoch.

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-29 Thread Russell Alexander Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Alexander Spitzer updated SPARK-11415:
--
Affects Version/s: 1.5.1

> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-29 Thread Russell Alexander Spitzer (JIRA)
Russell Alexander Spitzer created SPARK-11415:
-

 Summary: Catalyst DateType Shifts Input Data by Local Timezone
 Key: SPARK-11415
 URL: https://issues.apache.org/jira/browse/SPARK-11415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Russell Alexander Spitzer


I've been running type tests for the Spark Cassandra Connector and couldn't get 
a consistent result for java.sql.Date. I investigated and noticed the following 
code is used to create Catalyst.DateTypes

https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
{code}
 /**
   * Returns the number of days since epoch from from java.sql.Date.
   */
  def fromJavaDate(date: Date): SQLDate = {
millisToDays(date.getTime)
  }
{code}

But millisToDays does not abide by this contract, shifting the underlying 
timestamp to the local timezone before calculating the days from epoch. This 
causes the invocation to move the actual date around.

{code}
  // we should use the exact day as Int, for example, (year, month, day) -> day
  def millisToDays(millisUtc: Long): SQLDate = {
// SPARK-6785: use Math.floor so negative number of days (dates before 1970)
// will correctly work as input for function toJavaDate(Int)
val millisLocal = millisUtc + 
threadLocalLocalTimeZone.get().getOffset(millisUtc)
Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
  }
{code}

The inverse function also incorrectly shifts the timezone
{code}
  // reverse of millisToDays
  def daysToMillis(days: SQLDate): Long = {
val millisUtc = days.toLong * MILLIS_PER_DAY
millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
  }

{code}
https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93

This will cause 1-off errors and could cause significant shifts in data if the 
underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11415) Catalyst DateType Shifts Input Data by Local Timezone

2015-10-29 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14981868#comment-14981868
 ] 

Russell Alexander Spitzer commented on SPARK-11415:
---

https://github.com/apache/spark/pull/9369/files PR Submitted for fix


> Catalyst DateType Shifts Input Data by Local Timezone
> -
>
> Key: SPARK-11415
> URL: https://issues.apache.org/jira/browse/SPARK-11415
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Russell Alexander Spitzer
>
> I've been running type tests for the Spark Cassandra Connector and couldn't 
> get a consistent result for java.sql.Date. I investigated and noticed the 
> following code is used to create Catalyst.DateTypes
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L139-L144
> {code}
>  /**
>* Returns the number of days since epoch from from java.sql.Date.
>*/
>   def fromJavaDate(date: Date): SQLDate = {
> millisToDays(date.getTime)
>   }
> {code}
> But millisToDays does not abide by this contract, shifting the underlying 
> timestamp to the local timezone before calculating the days from epoch. This 
> causes the invocation to move the actual date around.
> {code}
>   // we should use the exact day as Int, for example, (year, month, day) -> 
> day
>   def millisToDays(millisUtc: Long): SQLDate = {
> // SPARK-6785: use Math.floor so negative number of days (dates before 
> 1970)
> // will correctly work as input for function toJavaDate(Int)
> val millisLocal = millisUtc + 
> threadLocalLocalTimeZone.get().getOffset(millisUtc)
> Math.floor(millisLocal.toDouble / MILLIS_PER_DAY).toInt
>   }
> {code}
> The inverse function also incorrectly shifts the timezone
> {code}
>   // reverse of millisToDays
>   def daysToMillis(days: SQLDate): Long = {
> val millisUtc = days.toLong * MILLIS_PER_DAY
> millisUtc - threadLocalLocalTimeZone.get().getOffset(millisUtc)
>   }
> {code}
> https://github.com/apache/spark/blob/bb3b3627ac3fcd18be7fb07b6d0ba5eae0342fc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L81-L93
> This will cause 1-off errors and could cause significant shifts in data if 
> the underlying data is worked on in different timezones than UTC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10978) Allow PrunedFilterScan to eliminate predicates from further evaluation

2015-10-07 Thread Russell Alexander Spitzer (JIRA)
Russell Alexander Spitzer created SPARK-10978:
-

 Summary: Allow PrunedFilterScan to eliminate predicates from 
further evaluation
 Key: SPARK-10978
 URL: https://issues.apache.org/jira/browse/SPARK-10978
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.5.0, 1.4.0, 1.3.0
Reporter: Russell Alexander Spitzer
 Fix For: 1.6.0


Currently PrunedFilterScan allows implementors to push down predicates to an 
underlying datasource. This is done solely as an optimization as the predicate 
will be reapplied on the Spark side as well. This allows for bloom-filter like 
operations but ends up doing a redundant scan for those sources which can do 
accurate pushdowns.

In addition it makes it difficult for underlying sources to accept queries 
which reference non-existent to provide ancillary function. In our case we 
allow a solr query to be passed in via a non-existent solr_query column. Since 
this column is not returned when Spark does a filter on "solr_query" nothing 
passes. 

Suggestion on the ML from [~marmbrus] 
{quote}
We have to try and maintain binary compatibility here, so probably the easiest 
thing to do here would be to add a method to the class.  Perhaps something like:

def unhandledFilters(filters: Array[Filter]): Array[Filter] = filters

By default, this could return all filters so behavior would remain the same, 
but specific implementations could override it.  There is still a chance that 
this would conflict with existing methods, but hopefully that would not be a 
problem in practice.
{quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming

2015-09-30 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937471#comment-14937471
 ] 

Russell Alexander Spitzer commented on SPARK-9472:
--

Yeah that's the workaround we recommend now, but that requires every 
application to manually specify the files. We just want our distribution to not 
require that much manual intervention (normally we automatically pass in 
required hadoop conf based on the users security setup via spark.hadoop)

> Consistent hadoop config for streaming
> --
>
> Key: SPARK-9472
> URL: https://issues.apache.org/jira/browse/SPARK-9472
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9472) Consistent hadoop config for streaming

2015-09-29 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14936293#comment-14936293
 ] 

Russell Alexander Spitzer commented on SPARK-9472:
--

Any thoughts on back-porting this to previous lines? We just hit this trying to 
pass in hadoop variables via "spark.hadoop" and having them not register with 
the Streaming Context via getOrCreate.

> Consistent hadoop config for streaming
> --
>
> Key: SPARK-9472
> URL: https://issues.apache.org/jira/browse/SPARK-9472
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Cody Koeninger
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison

2015-06-05 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14574983#comment-14574983
 ] 

Russell Alexander Spitzer commented on SPARK-6987:
--

Or being able to specify an identifier for each spark worker that wasn't 
dependent on ip?

 Node Locality is determined with String Matching instead of Inet Comparison
 ---

 Key: SPARK-6987
 URL: https://issues.apache.org/jira/browse/SPARK-6987
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Russell Alexander Spitzer

 When determining whether or not a task can be run NodeLocal the 
 TaskSetManager ends up using a direct string comparison between the 
 preferredIp and the executor's bound interface.
 https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880
 https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490
 This means that the preferredIp must be a direct string match of the ip the 
 the worker is bound to. This means that apis which are gathering data from 
 other distributed sources must develop their own mapping between the 
 interfaces bound (or exposed) by the external sources and the interface bound 
 by the Spark executor since these may be different. 
 For example, Cassandra exposes a broadcast rpc address which doesn't have to 
 match the address which the service is bound to. This means when adding 
 preferredLocation data we must add both the rpc and the listen address to 
 ensure that we can get a string match (and of course we are out of luck if 
 Spark has been bound on to another interface). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison

2015-06-04 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14572218#comment-14572218
 ] 

Russell Alexander Spitzer commented on SPARK-6987:
--

For Inet comparison I think the best think you can do is compare resolved 
hostnames but even that isn't really great. I've been looking into other 
solutions but  haven't found anything really satisfactory.

Having each worker/executor list all interfaces could be useful, that way any 
service as long as it is bound to a real interface on the machine could be 
properly matched.

 Node Locality is determined with String Matching instead of Inet Comparison
 ---

 Key: SPARK-6987
 URL: https://issues.apache.org/jira/browse/SPARK-6987
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.2.0, 1.3.0
Reporter: Russell Alexander Spitzer

 When determining whether or not a task can be run NodeLocal the 
 TaskSetManager ends up using a direct string comparison between the 
 preferredIp and the executor's bound interface.
 https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880
 https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490
 This means that the preferredIp must be a direct string match of the ip the 
 the worker is bound to. This means that apis which are gathering data from 
 other distributed sources must develop their own mapping between the 
 interfaces bound (or exposed) by the external sources and the interface bound 
 by the Spark executor since these may be different. 
 For example, Cassandra exposes a broadcast rpc address which doesn't have to 
 match the address which the service is bound to. This means when adding 
 preferredLocation data we must add both the rpc and the listen address to 
 ensure that we can get a string match (and of course we are out of luck if 
 Spark has been bound on to another interface). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-05-01 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524372#comment-14524372
 ] 

Russell Alexander Spitzer commented on SPARK-6069:
--

Running with --conf spark.files.userClassPathFirst=true yields a different error

{code}
scala cc.sql(SELECT * FROM test.fun as a JOIN test.fun as b ON (a.k = 
b.v)).collect
15/05/01 17:24:34 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.0.2.15): java.lang.NoClassDefFoundError: org/apache/spark/Partition
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader$userClassLoader$.findClass(ExecutorURLClassLoader.scala:42)
at 
org.apache.spark.executor.ChildExecutorURLClassLoader.findClass(ExecutorURLClassLoader.scala:50)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at 
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at 
org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
at 
org.apache.spark.repl.ExecutorClassLoader$$anonfun$findClass$1.apply(ExecutorClassLoader.scala:57)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:57)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Partition
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at 

[jira] [Commented] (SPARK-6069) Deserialization Error ClassNotFoundException with Kryo, Guava 14

2015-05-01 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524363#comment-14524363
 ] 

Russell Alexander Spitzer commented on SPARK-6069:
--

We've seen the same issue while developing the Spark Cassandra Connector. 
Unless the connector lib is loaded via spark.executor.extraClassPath, 
kryoSerializaition for joins always returns a classNotFound even though all 
operations which don't require a shuffle are fine. 

{code}
com.esotericsoftware.kryo.KryoException: Unable to find class: 
org.apache.spark.sql.cassandra.CassandraSQLRow
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultClassResolver.java:138)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:115)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:144)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.joins.HashedRelation$.apply(HashedRelation.scala:80)
at 
org.apache.spark.sql.execution.joins.ShuffledHashJoin$$anonfun$execute$1.apply(ShuffledHashJoin.scala:46)
at 
org.apache.spark.sql.execution.joins.ShuffledHashJoin$$anonfun$execute$1.apply(ShuffledHashJoin.scala:45)
at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

Adding the jar to executorExtraClasspath rather than --jars solves the issue.

 Deserialization Error ClassNotFoundException with Kryo, Guava 14
 

 Key: SPARK-6069
 URL: https://issues.apache.org/jira/browse/SPARK-6069
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Standalone one worker cluster on localhost, or any 
 cluster
Reporter: Pat Ferrel
Priority: Critical

 A class is contained in the jars passed in when creating a context. It is 
 registered with kryo. The class (Guava HashBiMap) is created correctly from 
 an RDD and broadcast but the deserialization fails with ClassNotFound.
 The work around is to hard code the path to the jar and make it available on 
 all workers. Hard code because we are creating a library so there is no easy 
 way to pass in to the app something like:
 spark.executor.extraClassPath  /path/to/some.jar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL

2015-04-22 Thread Russell Alexander Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Alexander Spitzer updated SPARK-7061:
-
Description: 
Running the following code in the  spark shell against a stand alone master.

{code}
case class CustomerID( id:Int)
sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1)
{code}

Gives the following exception

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

I believe this is related to the shuffle code since the following other 
examples also give this exception.

{code}
val idsOfInterest = sc.parallelize(1 to 
1000).map(CustomerID(_)).groupBy(_.id).take(1)
val idsOfInterest = sc.parallelize(1 to 1000).map( x = 
(CustomerID(_),x)).groupByKey().take(1)
val idsOfInterest = sc.parallelize(1 to 1000).map( x = 

[jira] [Created] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL

2015-04-22 Thread Russell Alexander Spitzer (JIRA)
Russell Alexander Spitzer created SPARK-7061:


 Summary: Case Classes Cannot be Repartitioned/Shuffled in Spark 
REPL
 Key: SPARK-7061
 URL: https://issues.apache.org/jira/browse/SPARK-7061
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.1
 Environment: Single Node Stand Alone Spark Shell
Reporter: Russell Alexander Spitzer
Priority: Minor


Running the following code in the  spark shell against a stand alone master.

{code}
case class CustomerID( id:Int)
sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1)
{code}

Gives the following exception

{code}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:274)
at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at 
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

I believe this is related to the shuffle code since the following other 

[jira] [Commented] (SPARK-7061) Case Classes Cannot be Repartitioned/Shuffled in Spark REPL

2015-04-22 Thread Russell Alexander Spitzer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14508170#comment-14508170
 ] 

Russell Alexander Spitzer commented on SPARK-7061:
--

Thanks! I didn't see it come up in a general text search, only the previous one 
I linked. But that is the issue (6299)

 Case Classes Cannot be Repartitioned/Shuffled in Spark REPL
 ---

 Key: SPARK-7061
 URL: https://issues.apache.org/jira/browse/SPARK-7061
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.2.1
 Environment: Single Node Stand Alone Spark Shell
Reporter: Russell Alexander Spitzer
Priority: Minor

 Running the following code in the  spark shell against a stand alone master.
 {code}
 case class CustomerID( id:Int)
 sc.parallelize(1 to 1000).map(CustomerID(_)).repartition(1).take(1)
 {code}
 Gives the following exception
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
 (TID 5, 10.0.2.15): java.lang.ClassNotFoundException: $iwC$$iwC$CustomerID
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:274)
   at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
   at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
   at org.apache.spark.rdd.RDD$$anonfun$27.apply(RDD.scala:1098)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1353)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
   

[jira] [Created] (SPARK-6987) Node Locality is determined with String Matching instead of Inet Comparison

2015-04-17 Thread Russell Alexander Spitzer (JIRA)
Russell Alexander Spitzer created SPARK-6987:


 Summary: Node Locality is determined with String Matching instead 
of Inet Comparison
 Key: SPARK-6987
 URL: https://issues.apache.org/jira/browse/SPARK-6987
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core
Affects Versions: 1.3.0, 1.2.0
Reporter: Russell Alexander Spitzer


When determining whether or not a task can be run NodeLocal the TaskSetManager 
ends up using a direct string comparison between the preferredIp and the 
executor's bound interface.

https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L878-L880

https://github.com/apache/spark/blob/c84d91692aa25c01882bcc3f9fd5de3cfa786195/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L488-L490

This means that the preferredIp must be a direct string match of the ip the the 
worker is bound to. This means that apis which are gathering data from other 
distributed sources must develop their own mapping between the interfaces bound 
(or exposed) by the external sources and the interface bound by the Spark 
executor since these may be different. 

For example, Cassandra exposes a broadcast rpc address which doesn't have to 
match the address which the service is bound to. This means when adding 
preferredLocation data we must add both the rpc and the listen address to 
ensure that we can get a string match (and of course we are out of luck if 
Spark has been bound on to another interface). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org