Is there a way to tell if a receiver is a Reliable Receiver?
I can't seem to find anywhere that would let a user know if the receiver they are using is reliable or not. Even better would be a list of known reliable receivers. Are any of these things possible? Or do you just have to research your receiver beforehand? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-tell-if-a-receiver-is-a-Reliable-Receiver-tp28609.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Avro/Parquet GenericFixed decimal is not read into Spark correctly
All, Before creating a JIRA for this I wanted to get a sense as to whether it would be shot down or not: Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new Conversions.DecimalConversion() val javaBD = BigDecimal.valueOf(643.85924958) val schema = Schema.parse("{\"type\":\"record\",\"name\":\"Header\",\"namespace\":\"org.apache.avro.file\",\"fields\":[" + "{\"name\":\"COLUMN\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"COLUMN\"," + "\"size\":19,\"precision\":17,\"scale\":8,\"logicalType\":\"decimal\"}]}]}" ) val schemaDec = schema.getField("COLUMN").schema() val fieldSchema = if(schemaDec.getType() == Schema.Type.UNION) schemaDec.getTypes.get(1) else schemaDec val converted = dc.toFixed(javaBD, fieldSchema, LogicalTypes.decimal(javaBD.precision, javaBD.scale)) sqlContext.createDataFrame(List(("value",converted))) and you'll get this error: java.lang.UnsupportedOperationException: Schema for type org.apache.avro.generic.GenericFixed is not supported However if you write out a parquet file using the AvroParquetWriter and the above GenericFixed value (converted), then read it in via the DataFrameReader the decimal value that is retrieved is not accurate (ie. 643... above is listed as -0.5...) Even if not supported, is there any way to at least have it throw an UnsupportedOperationException as it does when you try to do it directly (as compared to read in from a file) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Avro-Parquet-GenericFixed-decimal-is-not-read-into-Spark-correctly-tp28592.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
SparkStreaming getActiveOrCreate
The docs on getActiveOrCreate makes it seem that you'll get an already started context: > Either return the "active" StreamingContext (that is, started but not > stopped), or create a new StreamingContext that is However as far as I can tell from the code it is strictly dependent on the the implementation of the create function, and per the tests it is even expected that the create function will return a non-started function. So this makes for a bit awkward code that must check the state before starting or not. Was this considered when this was added? I couldn't find anything explicit -Justin Pihony -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkStreaming-getActiveOrCreate-tp28508.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Jar not in shell classpath in Windows 10
I've verified this is that issue, so please disregard. On Wed, Mar 1, 2017 at 1:07 AM, Justin Pihony <justin.pih...@gmail.com> wrote: > As soon as I posted this I found https://issues.apache. > org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at > it deeper now. > > On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com> > wrote: > >> Run spark-shell --packages >> datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an >> import of anything com.datastax. I have checked that the jar is listed >> among >> the classpaths and it is, albeit behind a spark URL. I'm wondering if >> added >> jars fail in windows due to this server abstraction? I can't find anything >> on this already, so maybe it's somehow an environment thing. Has anybody >> encountered this? >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows- >> 10-tp28442.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Jar not in shell classpath in Windows 10
As soon as I posted this I found https://issues.apache.org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at it deeper now. On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com> wrote: > Run spark-shell --packages > datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an > import of anything com.datastax. I have checked that the jar is listed > among > the classpaths and it is, albeit behind a spark URL. I'm wondering if added > jars fail in windows due to this server abstraction? I can't find anything > on this already, so maybe it's somehow an environment thing. Has anybody > encountered this? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows- > 10-tp28442.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Jar not in shell classpath in Windows 10
Run spark-shell --packages datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an import of anything com.datastax. I have checked that the jar is listed among the classpaths and it is, albeit behind a spark URL. I'm wondering if added jars fail in windows due to this server abstraction? I can't find anything on this already, so maybe it's somehow an environment thing. Has anybody encountered this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows-10-tp28442.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Is there a list of missing optimizations for typed functions?
I was curious if there was introspection of certain typed functions and ran the following two queries: ds.where($"col" > 1).explain ds.filter(_.col > 1).explain And found that the typed function does NOT result in a PushedFilter. I imagine this is due to a limited view of the function, so I have two questions really: 1.) Is there a list of the methods that lose some of the optimizations that you get from non-functional methods? Is it any method that accepts a generic function? 2.) Is there any work to attempt reflection and gain some of these optimizations back? I couldn't find anything in JIRA. Thanks, Justin Pihony -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-list-of-missing-optimizations-for-typed-functions-tp28418.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
K-Means seems biased to one center
(Cross post with http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center) I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions), I'm finally getting a KMeansModel...but when I try to predict, I keep getting the same center. Is this because of the small dataset, or because I'm comparing a multi-word document against a smaller amount of words(1-20) query? Or is there something else I'm doing wrong? See the below code: //Preprocessing of data includes splitting into words //and removing words with only 1 or 2 characters val corpus: RDD[Seq[String]] val hashingTF = new HashingTF(10) val tf = hashingTF.transform(corpus) val idf = new IDF().fit(tf) val tfidf = idf.transform(tf).cache val kMeansModel = KMeans.train(tfidf, 3, 10) val queryTf = hashingTF.transform(List("music")) val queryTfidf = idf.transform(queryTf) kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied
Re: Is MLBase dead?
To take a stab at my own answer: MLBase is now fully integrated into MLLib. MLI/MLLib are the mllib algorithms and MLO is the ml pipelines? On Mon, Sep 28, 2015 at 10:19 PM, Justin Pihony <justin.pih...@gmail.com> wrote: > As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and > org.apache.spark.ml? I cannot find anything official, and the last updates > seem to be a year or two old. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Is MLBase dead?
As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and org.apache.spark.ml? I cannot find anything official, and the last updates seem to be a year or two old. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to access Spark UI through AWS
I figured it all out after this: http://apache-spark-user-list.1001560.n3.nabble.com/WebUI-on-yarn-through-ssh-tunnel-affected-by-AmIpfilter-td21540.html The short is that I needed to set SPARK_PUBLIC_DNS (not DNS_HOME) = ec2_publicdns then the YARN proxy gets in the way, so I needed to go to: http://ec2_publicdns:20888/proxy/applicationid/jobs (9046 is the older emr port) or, as Jonathan said, the spark history server works once a job is completed. On Tue, Aug 25, 2015 at 5:26 PM, Justin Pihony justin.pih...@gmail.com wrote: OK, I figured the horrid look alsothe href of all of the styles is prefixed with the proxy dataso, ultimately if I can fix the proxy issues with the links, then I can fix the look also On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com wrote: SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to access the spark ui directly. The application proxy was still getting in the way by the way it creates the URL, so I manually filled in the /stage?id=#attempt=# and that workedI'm still having trouble with the css as the UI looks horridbut I'll tackle that next :) On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com wrote: Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com wrote: I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History link for completed applications) to get to the Spark UI from there. The ApplicationMaster link will use the YARN Proxy Service (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark application's UI, regardless of what port it's running on. For completed applications, the History link will send you directly to the Spark History Server UI on port 18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI -through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to access Spark UI through AWS
Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com wrote: I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History link for completed applications) to get to the Spark UI from there. The ApplicationMaster link will use the YARN Proxy Service (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark application's UI, regardless of what port it's running on. For completed applications, the History link will send you directly to the Spark History Server UI on port 18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI -through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to access Spark UI through AWS
OK, I figured the horrid look alsothe href of all of the styles is prefixed with the proxy dataso, ultimately if I can fix the proxy issues with the links, then I can fix the look also On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com wrote: SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to access the spark ui directly. The application proxy was still getting in the way by the way it creates the URL, so I manually filled in the /stage?id=#attempt=# and that workedI'm still having trouble with the css as the UI looks horridbut I'll tackle that next :) On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com wrote: Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com wrote: I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History link for completed applications) to get to the Spark UI from there. The ApplicationMaster link will use the YARN Proxy Service (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark application's UI, regardless of what port it's running on. For completed applications, the History link will send you directly to the Spark History Server UI on port 18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI -through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to access Spark UI through AWS
SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to access the spark ui directly. The application proxy was still getting in the way by the way it creates the URL, so I manually filled in the /stage?id=#attempt=# and that workedI'm still having trouble with the css as the UI looks horridbut I'll tackle that next :) On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com wrote: Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com wrote: I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History link for completed applications) to get to the Spark UI from there. The ApplicationMaster link will use the YARN Proxy Service (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark application's UI, regardless of what port it's running on. For completed applications, the History link will send you directly to the Spark History Server UI on port 18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI -through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Got wrong md5sum for boto
Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com wrote: When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Got wrong md5sum for boto
I found this solution: https://stackoverflow.com/questions/3390484/python-hashlib-md5-differs-between-linux-windows Does anybody see a reason why I shouldn't put in a PR to make this change? FROM with open(tgz_file_path) as tar: TO with open(tgz_file_path, rb) as tar: On Mon, Aug 24, 2015 at 11:58 AM, Justin Pihony justin.pih...@gmail.com wrote: Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com wrote: When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Got wrong md5sum for boto
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to access Spark UI through AWS
I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started at the internal ip of 4040. If I hit the public dns at 4040 with dynamic port tunneling and foxyproxy then I get a crude UI (css seems broken), however the proxy continuously redirects me to the main page, so I cannot drill into anything. So, I tried static tunneling, but can't seem to get through. So, how can I access the spark UI when running a spark shell in AWS yarn? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI-through-AWS-tp24436.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Accumulators in Spark Streaming on UI
You need to make sure to name the accumulator. On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com wrote: Hello all, I have accumulator in spark streaming application which counts number of events received from Kafka. From the documentation , It seems Spark UI has support to display it . But I am unable to see it on UI. I am using spark 1.3.1 Do I need to call any method (print) or am I missing something ? Thanks in advance, Snehal
Re: Why is RDD to PairRDDFunctions only via implicits?
The (crude) proof of concept seems to work: class RDD[V](value: List[V]){ def doStuff = println(I'm doing stuff) } object RDD{ implicit def toPair[V](x:RDD[V]) = new PairRDD(List((1,2))) } class PairRDD[K,V](value: List[(K,V)]) extends RDD (value){ def doPairs = println(I'm using pairs) } class Context{ def parallelize[K,V](x: List[(K,V)]) = new PairRDD(x) def parallelize[V](x: List[V]) = new RDD(x) } On Fri, May 22, 2015 at 2:44 PM, Reynold Xin r...@databricks.com wrote: I'm not sure if it is possible to overload the map function twice, once for just KV pairs, and another for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com wrote: This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but it could be even more discoverable if made available via the API directly. I assume this was originally an omission that now needs to be kept for backwards compatibility, but would any of the repo owners be open to making this more discoverable to the point of API docs and tab completion (while keeping both binary and source compatibility)? class PairRDD extends RDD{ pair methods } RDD{ def map[K: ClassTag, V: ClassTag](f: T = (K,V)):PairRDD[K,V] } As long as the implicits remain, then compatibility remains, but now it is explicit in the docs on how to get a PairRDD and in tab completion. Thoughts? Justin Pihony
Why is RDD to PairRDDFunctions only via implicits?
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but it could be even more discoverable if made available via the API directly. I assume this was originally an omission that now needs to be kept for backwards compatibility, but would any of the repo owners be open to making this more discoverable to the point of API docs and tab completion (while keeping both binary and source compatibility)? class PairRDD extends RDD{ pair methods } RDD{ def map[K: ClassTag, V: ClassTag](f: T = (K,V)):PairRDD[K,V] } As long as the implicits remain, then compatibility remains, but now it is explicit in the docs on how to get a PairRDD and in tab completion. Thoughts? Justin Pihony
Re: Spark logo license
Thanks! On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using the spark logo. Is it free to be used for displaying commercially? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark logo license
What is the license on using the spark logo. Is it free to be used for displaying commercially? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Windows DOS bug in windows-utils.cmd
When running something like this: spark-shell --jars foo.jar,bar.jar This keeps failing to include the tail of the jars list. Digging into the launch scripts I found that the comma makes it so that the list was sent as separate parameters. So, to keep things together, I tried spark-shell --jars foo.jar, bar.jar But, this still failed as the quotes carried over into some of the string checks and resulted in invalid character errors. So, I am curious if anybody sees a problem with making a PR to fix the script from ... if x%2==x ( echo %1 requires an argument. 2 exit /b 1 ) set SUBMISSION_OPTS=%SUBMISSION_OPTS% %1 %2 ... TO ... if x%~2==x ( echo %1 requires an argument. 2 exit /b 1 ) set SUBMISSION_OPTS=%SUBMISSION_OPTS% %1 %~2 ... The only difference is the use of the tilde to remove any surrounding quotes if there are some. I figured I would ask here first to vet any unforeseen bugs this might cause in other systems. As far as I know this should be harmless and only make it so that comma separated lists will work in DOS. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Windows-DOS-bug-in-windows-utils-cmd-tp22946.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
TwitterUtils on Windows
I am trying to print a basic twitter stream and receiving the following error: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to C:\Users\Justin\AppData\Local\Temp\spark-4a37d3 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja va:715) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:366) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Code is: spark-shell --jars \Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming._ System.setProperty(twitter4j.oauth.consumerKey,*) System.setProperty(twitter4j.oauth.consumerSecret,*) System.setProperty(twitter4j.oauth.accessToken,*) System.setProperty(twitter4j.oauth.accessTokenSecret,*) val ssc = new StreamingContext(sc, Seconds(10)) val stream = TwitterUtils.createStream(ssc, None) stream.print ssc.start This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath, a+x) but Im not sure why... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TwitterUtils on Windows
I think I found the answer - http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html Do I have no way of running this in Windows locally? On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com wrote: I'm not 100% sure that is causing a problem, though. The stream still starts, but is giving blank output. I checked the environment variables in the ui and it is running local[*], so there should be no bottleneck there. On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com wrote: I am trying to print a basic twitter stream and receiving the following error: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to C:\Users\Justin\AppData\Local\Temp\spark-4a37d3 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja va:715) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org $apache$spark$executor$Executor$$updateDependencies(Executor.scala:366) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Code is: spark-shell --jars \Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming._ System.setProperty(twitter4j.oauth.consumerKey,*) System.setProperty(twitter4j.oauth.consumerSecret,*) System.setProperty(twitter4j.oauth.accessToken,*) System.setProperty(twitter4j.oauth.accessTokenSecret,*) val ssc = new StreamingContext(sc, Seconds(10)) val stream = TwitterUtils.createStream(ssc, None) stream.print ssc.start This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath, a+x) but Im not sure why... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: TwitterUtils on Windows
I'm not 100% sure that is causing a problem, though. The stream still starts, but is giving blank output. I checked the environment variables in the ui and it is running local[*], so there should be no bottleneck there. On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com wrote: I am trying to print a basic twitter stream and receiving the following error: 15/05/18 22:03:14 INFO Executor: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with timestamp 1432000973058 15/05/18 22:03:14 INFO Utils: Fetching http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to C:\Users\Justin\AppData\Local\Temp\spark-4a37d3 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012) at org.apache.hadoop.util.Shell.runCommand(Shell.java:482) at org.apache.hadoop.util.Shell.run(Shell.java:455) at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja va:715) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873) at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853) at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374) at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) at org.apache.spark.executor.Executor.org $apache$spark$executor$Executor$$updateDependencies(Executor.scala:366) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:744) Code is: spark-shell --jars \Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar import org.apache.spark.streaming.twitter._ import org.apache.spark.streaming._ System.setProperty(twitter4j.oauth.consumerKey,*) System.setProperty(twitter4j.oauth.consumerSecret,*) System.setProperty(twitter4j.oauth.accessToken,*) System.setProperty(twitter4j.oauth.accessTokenSecret,*) val ssc = new StreamingContext(sc, Seconds(10)) val stream = TwitterUtils.createStream(ssc, None) stream.print ssc.start This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath, a+x) but Im not sure why... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Trying to understand sc.textFile better
All, I am trying to understand the textFile method deeply, but I think my lack of deep Hadoop knowledge is holding me back here. Let me lay out my understanding and maybe you can correct anything that is incorrect When sc.textFile(path) is called, then defaultMinPartitions is used, which is really just math.min(taskScheduler.defaultParallelism, 2). Let's assume we are using the SparkDeploySchedulerBackend and this is conf.getInt(spark.default.parallelism, math.max(totalCoreCount.get(), 2)) So, now let's say the default is 2, going back to the textFile, this is passed in to HadoopRDD. The true size is determined in getPartitions() using inputFormat.getSplits(jobConf, minPartitions). But, from what I can find, the partitions is merely a hint and is in fact mostly ignored, so you will probably get the total number of blocks. OK, this fits with expectations, however what if the default is not used and you provide a partition size that is larger than the block size. If my research is right and the getSplits call simply ignores this parameter, then wouldn't the provided min end up being ignored and you would still just get the block size? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-understand-sc-textFile-better-tp22924.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Did DataFrames break basic SQLContext?
I started to play with 1.3.0 and found that there are a lot of breaking changes. Previously, I could do the following: case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) import sqlContext._ rdd.registerTempTable(foo) Now, I am not able to directly use my RDD object and have it implicitly become a DataFrame. It can be used as a DataFrameHolder, of which I could write: rdd.toDF.registerTempTable(foo) But, that is kind of a pain in comparison. The other problem for me is that I keep getting a SQLException: java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader@10393e97, see the next exception for details. This seems to be a dependency on Hive, when previously (1.2.0) there was no such dependency. I can open tickets for these, but wanted to ask here firstmaybe I am doing something wrong? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Did DataFrames break basic SQLContext?
It appears that the metastore_db problem is related to https://issues.apache.org/jira/browse/SPARK-4758. I had another shell open that was stuck. This is probably a bug, though? import sqlContext.implicits case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) rdd.toDF results in a frozen shell after this line: INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . which, locks the internally created metastore_db On Wed, Mar 18, 2015 at 11:20 AM, Justin Pihony justin.pih...@gmail.com wrote: I started to play with 1.3.0 and found that there are a lot of breaking changes. Previously, I could do the following: case class Foo(x: Int) val rdd = sc.parallelize(List(Foo(1))) import sqlContext._ rdd.registerTempTable(foo) Now, I am not able to directly use my RDD object and have it implicitly become a DataFrame. It can be used as a DataFrameHolder, of which I could write: rdd.toDF.registerTempTable(foo) But, that is kind of a pain in comparison. The other problem for me is that I keep getting a SQLException: java.sql.SQLException: Failed to start database 'metastore_db' with class loader sun.misc.Launcher$AppClassLoader@10393e97, see the next exception for details. This seems to be a dependency on Hive, when previously (1.2.0) there was no such dependency. I can open tickets for these, but wanted to ask here firstmaybe I am doing something wrong? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Bug in Streaming files?
All, Looking into this StackOverflow question https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469 it appears that there is a bug when utilizing the newFilesOnly parameter in FileInputDStream. Before creating a ticket, I wanted to verify it here. The gist is that this code is wrong: val modTimeIgnoreThreshold = math.max( initialModTimeIgnoreThreshold, // initial threshold based on newFilesOnly setting currentTime - durationToRemember.milliseconds // trailing end of the remember window ) The problem is that if you set newFilesOnly to false, then the initialModTimeIgnoreThreshold is always 0. This makes it always dropped out of the max operation. So, the best you get is files that were put in the directory (duration) from the start. Is this a bug or expected behavior; it seems like a bug to me. If I am correct, this appears to be a bigger fix than just using min as it would break other functionality. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SparkSQL JSON array support
Is there any plans of supporting JSON arrays more fully? Take for example: val myJson = sqlContext.jsonRDD(List({foo:[{bar:1},{baz:2}]})) myJson.registerTempTable(JsonTest) I would like a way to pull out parts of the array data based on a key sql(SELECT foo[bar] FROM JsonTest) //projects only the object with bar, the rest would be null I could even work around this if there was some way to access the key name from the SchemaRDD: myJson.filter(x=x(0).asInstanceOf[Seq[Row]].exists(y=y.key == bar)) .map(x=x(0).asInstanceOf[Seq[Row]].filter(y=y.key == bar)) //This does the same as above, except also filtering out those without a bar key This is the closest suggestion I could find thus far, https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView which still does not solve the problem of pulling out the keys. I tried with a UDF also, but could not currently make that work either. If there isn't anything in the works, then would it be appropriate to create a ticket for this? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark SQL Static Analysis
Thanks! On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com wrote: It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com wrote: I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot find the presentation now, nor can I find any documentation or research papers on the topic. So, I am curious if there is indeed any work going on for this topic. The two things I would be interested in would be to be able to gain compile time safety, as well as gain the ability to work on my data as a type instead of a row (ie, result.map(x=x.Age) instead of having to use Row.get) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Static-Analysis-tp21918.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark SQL Static Analysis
I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot find the presentation now, nor can I find any documentation or research papers on the topic. So, I am curious if there is indeed any work going on for this topic. The two things I would be interested in would be to be able to gain compile time safety, as well as gain the ability to work on my data as a type instead of a row (ie, result.map(x=x.Age) instead of having to use Row.get) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Static-Analysis-tp21918.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
SQLContext.applySchema strictness
Per the documentation: It is important to make sure that the structure of every Row of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. However, it appears that this is not being enforced. import org.apache.spark.sql._ val sqlContext = new SqlContext(sc) val struct = StructType(List(StructField(test, BooleanType, true))) val myData = sc.parallelize(List(Row(0), Row(true), Row(stuff))) val schemaData = sqlContext.applySchema(myData, struct) //No error schemaData.collect()(0).getBoolean(0) //Only now will I receive an error Is this expected or a bug? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: SQLContext.applySchema strictness
OK, but what about on an action, like collect()? Shouldn't it be able to determine the correctness at that time? On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need to be scanned to give a correct answer. Thanks, Yin On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com wrote: Per the documentation: It is important to make sure that the structure of every Row of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. However, it appears that this is not being enforced. import org.apache.spark.sql._ val sqlContext = new SqlContext(sc) val struct = StructType(List(StructField(test, BooleanType, true))) val myData = sc.parallelize(List(Row(0), Row(true), Row(stuff))) val schemaData = sqlContext.applySchema(myData, struct) //No error schemaData.collect()(0).getBoolean(0) //Only now will I receive an error Is this expected or a bug? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org