Is there a way to tell if a receiver is a Reliable Receiver?

2017-04-17 Thread Justin Pihony
I can't seem to find anywhere that would let a user know if the receiver they
are using is reliable or not. Even better would be a list of known reliable
receivers. Are any of these things possible? Or do you just have to research
your receiver beforehand?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-tell-if-a-receiver-is-a-Reliable-Receiver-tp28609.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Avro/Parquet GenericFixed decimal is not read into Spark correctly

2017-04-12 Thread Justin Pihony
All,

Before creating a JIRA for this I wanted to get a sense as to whether it
would be shot down or not:

Take the following code:

spark-shell --packages org.apache.avro:avro:1.8.1
import org.apache.avro.{Conversions, LogicalTypes, Schema}
import java.math.BigDecimal
val dc = new Conversions.DecimalConversion()
val javaBD = BigDecimal.valueOf(643.85924958)
val schema =
   
Schema.parse("{\"type\":\"record\",\"name\":\"Header\",\"namespace\":\"org.apache.avro.file\",\"fields\":["
+
 
"{\"name\":\"COLUMN\",\"type\":[\"null\",{\"type\":\"fixed\",\"name\":\"COLUMN\","
+
 
"\"size\":19,\"precision\":17,\"scale\":8,\"logicalType\":\"decimal\"}]}]}"
)
val schemaDec = schema.getField("COLUMN").schema()
val fieldSchema = if(schemaDec.getType() == Schema.Type.UNION)
schemaDec.getTypes.get(1) else schemaDec
val converted = dc.toFixed(javaBD, fieldSchema,
LogicalTypes.decimal(javaBD.precision, javaBD.scale))
sqlContext.createDataFrame(List(("value",converted)))

and you'll get this error:

java.lang.UnsupportedOperationException: Schema for type
org.apache.avro.generic.GenericFixed is not supported

However if you write out a parquet file using the AvroParquetWriter and the
above GenericFixed value (converted), then read it in via the
DataFrameReader the decimal value that is retrieved is not accurate (ie.
643... above is listed as -0.5...)

Even if not supported, is there any way to at least have it throw an
UnsupportedOperationException as it does when you try to do it directly (as
compared to read in from a file)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Avro-Parquet-GenericFixed-decimal-is-not-read-into-Spark-correctly-tp28592.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



SparkStreaming getActiveOrCreate

2017-03-18 Thread Justin Pihony
The docs on getActiveOrCreate makes it seem that you'll get an already
started context:

> Either return the "active" StreamingContext (that is, started but not
> stopped), or create a new StreamingContext that is

However as far as I can tell from the code it is strictly dependent on the
the implementation of the create function, and per the tests it is even
expected that the create function will return a non-started function. So
this makes for a bit awkward code that must check the state before starting
or not. Was this considered when this was added? I couldn't find anything
explicit

-Justin Pihony



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkStreaming-getActiveOrCreate-tp28508.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
I've verified this is that issue, so please disregard.

On Wed, Mar 1, 2017 at 1:07 AM, Justin Pihony <justin.pih...@gmail.com>
wrote:

> As soon as I posted this I found https://issues.apache.
> org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at
> it deeper now.
>
> On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com>
> wrote:
>
>> Run spark-shell --packages
>> datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an
>> import of anything com.datastax. I have checked that the jar is listed
>> among
>> the classpaths and it is, albeit behind a spark URL. I'm wondering if
>> added
>> jars fail in windows due to this server abstraction? I can't find anything
>> on this already, so maybe it's somehow an environment thing. Has anybody
>> encountered this?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows-
>> 10-tp28442.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
As soon as I posted this I found
https://issues.apache.org/jira/browse/SPARK-18648 which seems to be the
issue. I'm looking at it deeper now.

On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com>
wrote:

> Run spark-shell --packages
> datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an
> import of anything com.datastax. I have checked that the jar is listed
> among
> the classpaths and it is, albeit behind a spark URL. I'm wondering if added
> jars fail in windows due to this server abstraction? I can't find anything
> on this already, so maybe it's somehow an environment thing. Has anybody
> encountered this?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows-
> 10-tp28442.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
Run spark-shell --packages
datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an
import of anything com.datastax. I have checked that the jar is listed among
the classpaths and it is, albeit behind a spark URL. I'm wondering if added
jars fail in windows due to this server abstraction? I can't find anything
on this already, so maybe it's somehow an environment thing. Has anybody
encountered this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Jar-not-in-shell-classpath-in-Windows-10-tp28442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Is there a list of missing optimizations for typed functions?

2017-02-22 Thread Justin Pihony
I was curious if there was introspection of certain typed functions and ran
the following two queries:

ds.where($"col" > 1).explain
ds.filter(_.col > 1).explain

And found that the typed function does NOT result in a PushedFilter. I
imagine this is due to a limited view of the function, so I have two
questions really:

1.) Is there a list of the methods that lose some of the optimizations that
you get from non-functional methods? Is it any method that accepts a generic
function?
2.) Is there any work to attempt reflection and gain some of these
optimizations back? I couldn't find anything in JIRA.

Thanks,
Justin Pihony



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-list-of-missing-optimizations-for-typed-functions-tp28418.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



K-Means seems biased to one center

2015-10-05 Thread Justin Pihony
(Cross post with
http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center)


I have a corpus of wiki pages (baseball, hockey, music, football) which I'm
running through tfidf and then through kmeans. After a couple issues to
start (you can see my previous questions), I'm finally getting a
KMeansModel...but
when I try to predict, I keep getting the same center. Is this because of
the small dataset, or because I'm comparing a multi-word document against a
smaller amount of words(1-20) query? Or is there something else I'm doing
wrong? See the below code:

//Preprocessing of data includes splitting into words
//and removing words with only 1 or 2 characters
val corpus: RDD[Seq[String]]
val hashingTF = new HashingTF(10)
val tf = hashingTF.transform(corpus)
val idf = new IDF().fit(tf)
val tfidf = idf.transform(tf).cache
val kMeansModel = KMeans.train(tfidf, 3, 10)

val queryTf = hashingTF.transform(List("music"))
val queryTfidf = idf.transform(queryTf)
kMeansModel.predict(queryTfidf) //Always the same, no matter the term supplied


Re: Is MLBase dead?

2015-09-28 Thread Justin Pihony
To take a stab at my own answer: MLBase is now fully integrated into MLLib.
MLI/MLLib are the mllib algorithms and MLO is the ml pipelines?

On Mon, Sep 28, 2015 at 10:19 PM, Justin Pihony <justin.pih...@gmail.com>
wrote:

> As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and
> org.apache.spark.ml? I cannot find anything official, and the last updates
> seem to be a year or two old.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Is MLBase dead?

2015-09-28 Thread Justin Pihony
As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and
org.apache.spark.ml? I cannot find anything official, and the last updates
seem to be a year or two old.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
I figured it all out after this:
http://apache-spark-user-list.1001560.n3.nabble.com/WebUI-on-yarn-through-ssh-tunnel-affected-by-AmIpfilter-td21540.html

The short is that I needed to set

SPARK_PUBLIC_DNS (not DNS_HOME) = ec2_publicdns

then

the YARN proxy gets in the way, so I needed to go to:

http://ec2_publicdns:20888/proxy/applicationid/jobs  (9046 is the older
emr port)

or, as Jonathan said, the spark history server works once a job is
completed.

On Tue, Aug 25, 2015 at 5:26 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 OK, I figured the horrid look alsothe href of all of the styles is
 prefixed with the proxy dataso, ultimately if I can fix the proxy
 issues with the links, then I can fix the look also

 On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to
 access the spark ui directly. The application proxy was still getting in
 the way by the way it creates the URL, so I manually filled in the
 /stage?id=#attempt=# and that workedI'm still having trouble with the
 css as the UI looks horridbut I'll tackle that next :)

 On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 Thanks. I just tried and still am having trouble. It seems to still be
 using the private address even if I try going through the resource manager.

 On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

 I'm not sure why the UI appears broken like that either and haven't
 investigated it myself yet, but if you instead go to the YARN
 ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for
 3.x,
 I believe), then you should be able to click on the ApplicationMaster
 link
 (or the History link for completed applications) to get to the Spark UI
 from there. The ApplicationMaster link will use the YARN Proxy Service
 (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark
 application's UI, regardless of what port it's running on. For completed
 applications, the History link will send you directly to the Spark
 History
 Server UI on port 18080. Hope that helps!

 ~ Jonathan




 On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:

 I am using the steps from  this article
 https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
  to
 get spark up and running on EMR through yarn. Once up and running I
 ssh in
 and cd to the spark bin and run spark-shell --master yarn. Once this
 spins
 up I can see that the UI is started at the internal ip of 4040. If I
 hit
 the
 public dns at 4040 with dynamic port tunneling and foxyproxy then I
 get a
 crude UI (css seems broken), however the proxy continuously redirects
 me
 to
 the main page, so I cannot drill into anything. So, I tried static
 tunneling, but can't seem to get through.
 
 So, how can I access the spark UI when running a spark shell in AWS
 yarn?
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI
 -through-AWS-tp24436.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 







Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
Thanks. I just tried and still am having trouble. It seems to still be
using the private address even if I try going through the resource manager.

On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com
wrote:

 I'm not sure why the UI appears broken like that either and haven't
 investigated it myself yet, but if you instead go to the YARN
 ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x,
 I believe), then you should be able to click on the ApplicationMaster link
 (or the History link for completed applications) to get to the Spark UI
 from there. The ApplicationMaster link will use the YARN Proxy Service
 (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark
 application's UI, regardless of what port it's running on. For completed
 applications, the History link will send you directly to the Spark History
 Server UI on port 18080. Hope that helps!

 ~ Jonathan




 On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:

 I am using the steps from  this article
 https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923   to
 get spark up and running on EMR through yarn. Once up and running I ssh in
 and cd to the spark bin and run spark-shell --master yarn. Once this spins
 up I can see that the UI is started at the internal ip of 4040. If I hit
 the
 public dns at 4040 with dynamic port tunneling and foxyproxy then I get a
 crude UI (css seems broken), however the proxy continuously redirects me
 to
 the main page, so I cannot drill into anything. So, I tried static
 tunneling, but can't seem to get through.
 
 So, how can I access the spark UI when running a spark shell in AWS yarn?
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI
 -through-AWS-tp24436.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 




Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
OK, I figured the horrid look alsothe href of all of the styles is
prefixed with the proxy dataso, ultimately if I can fix the proxy
issues with the links, then I can fix the look also

On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to
 access the spark ui directly. The application proxy was still getting in
 the way by the way it creates the URL, so I manually filled in the
 /stage?id=#attempt=# and that workedI'm still having trouble with the
 css as the UI looks horridbut I'll tackle that next :)

 On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 Thanks. I just tried and still am having trouble. It seems to still be
 using the private address even if I try going through the resource manager.

 On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

 I'm not sure why the UI appears broken like that either and haven't
 investigated it myself yet, but if you instead go to the YARN
 ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for
 3.x,
 I believe), then you should be able to click on the ApplicationMaster
 link
 (or the History link for completed applications) to get to the Spark UI
 from there. The ApplicationMaster link will use the YARN Proxy Service
 (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark
 application's UI, regardless of what port it's running on. For completed
 applications, the History link will send you directly to the Spark
 History
 Server UI on port 18080. Hope that helps!

 ~ Jonathan




 On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:

 I am using the steps from  this article
 https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
  to
 get spark up and running on EMR through yarn. Once up and running I ssh
 in
 and cd to the spark bin and run spark-shell --master yarn. Once this
 spins
 up I can see that the UI is started at the internal ip of 4040. If I hit
 the
 public dns at 4040 with dynamic port tunneling and foxyproxy then I get
 a
 crude UI (css seems broken), however the proxy continuously redirects me
 to
 the main page, so I cannot drill into anything. So, I tried static
 tunneling, but can't seem to get through.
 
 So, how can I access the spark UI when running a spark shell in AWS
 yarn?
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI
 -through-AWS-tp24436.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 






Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to
access the spark ui directly. The application proxy was still getting in
the way by the way it creates the URL, so I manually filled in the
/stage?id=#attempt=# and that workedI'm still having trouble with the
css as the UI looks horridbut I'll tackle that next :)

On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 Thanks. I just tried and still am having trouble. It seems to still be
 using the private address even if I try going through the resource manager.

 On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

 I'm not sure why the UI appears broken like that either and haven't
 investigated it myself yet, but if you instead go to the YARN
 ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x,
 I believe), then you should be able to click on the ApplicationMaster link
 (or the History link for completed applications) to get to the Spark UI
 from there. The ApplicationMaster link will use the YARN Proxy Service
 (port 20888 on emr-4.x; not sure about 3.x) to proxy through the Spark
 application's UI, regardless of what port it's running on. For completed
 applications, the History link will send you directly to the Spark History
 Server UI on port 18080. Hope that helps!

 ~ Jonathan




 On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:

 I am using the steps from  this article
 https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
  to
 get spark up and running on EMR through yarn. Once up and running I ssh
 in
 and cd to the spark bin and run spark-shell --master yarn. Once this
 spins
 up I can see that the UI is started at the internal ip of 4040. If I hit
 the
 public dns at 4040 with dynamic port tunneling and foxyproxy then I get a
 crude UI (css seems broken), however the proxy continuously redirects me
 to
 the main page, so I cannot drill into anything. So, I tried static
 tunneling, but can't seem to get through.
 
 So, how can I access the spark UI when running a spark shell in AWS yarn?
 
 
 
 --
 View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI
 -through-AWS-tp24436.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 





Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
Additional info...If I use an online md5sum check then it matches...So,
it's either windows or python (using 2.7.10)

On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com
wrote:

 When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
 seen this on two different machines. I am running on windows, but I would
 imagine that shouldn't affect the md5. Is this a boto problem, python
 problem, spark problem?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
I found this solution:
https://stackoverflow.com/questions/3390484/python-hashlib-md5-differs-between-linux-windows

Does anybody see a reason why I shouldn't put in a PR to make this change?

FROM
with open(tgz_file_path) as tar:

TO
with open(tgz_file_path, rb) as tar:

On Mon, Aug 24, 2015 at 11:58 AM, Justin Pihony justin.pih...@gmail.com
wrote:

 Additional info...If I use an online md5sum check then it matches...So,
 it's either windows or python (using 2.7.10)

 On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com
 wrote:

 When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
 seen this on two different machines. I am running on windows, but I would
 imagine that shouldn't affect the md5. Is this a boto problem, python
 problem, spark problem?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
seen this on two different machines. I am running on windows, but I would
imagine that shouldn't affect the md5. Is this a boto problem, python
problem, spark problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Got-wrong-md5sum-for-boto-tp24420.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to access Spark UI through AWS

2015-08-24 Thread Justin Pihony
I am using the steps from  this article
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923   to
get spark up and running on EMR through yarn. Once up and running I ssh in
and cd to the spark bin and run spark-shell --master yarn. Once this spins
up I can see that the UI is started at the internal ip of 4040. If I hit the
public dns at 4040 with dynamic port tunneling and foxyproxy then I get a
crude UI (css seems broken), however the proxy continuously redirects me to
the main page, so I cannot drill into anything. So, I tried static
tunneling, but can't seem to get through.

So, how can I access the spark UI when running a spark shell in AWS yarn?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-access-Spark-UI-through-AWS-tp24436.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Accumulators in Spark Streaming on UI

2015-05-26 Thread Justin Pihony
You need to make sure to name the accumulator.

On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com
wrote:

 Hello all,

 I have  accumulator in spark streaming application which counts number of
 events received from Kafka.

 From the documentation , It seems Spark UI has support to display it .

 But I am unable to see it on UI. I am using spark 1.3.1

 Do I need to call any method (print)  or am I missing something ?


 Thanks in advance,

 Snehal





Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
The (crude) proof of concept seems to work:

class RDD[V](value: List[V]){
  def doStuff = println(I'm doing stuff)
}

object RDD{
  implicit def toPair[V](x:RDD[V]) = new PairRDD(List((1,2)))
}

class PairRDD[K,V](value: List[(K,V)]) extends RDD (value){
  def doPairs = println(I'm using pairs)
}

class Context{
  def parallelize[K,V](x: List[(K,V)]) = new PairRDD(x)
  def parallelize[V](x: List[V]) = new RDD(x)
}

On Fri, May 22, 2015 at 2:44 PM, Reynold Xin r...@databricks.com wrote:

 I'm not sure if it is possible to overload the map function twice, once
 for just KV pairs, and another for K and V separately.


 On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com
 wrote:

 This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved
 the RDD API, but it could be even more discoverable if made available via
 the API directly. I assume this was originally an omission that now needs
 to be kept for backwards compatibility, but would any of the repo owners be
 open to making this more discoverable to the point of API docs and tab
 completion (while keeping both binary and source compatibility)?


 class PairRDD extends RDD{
   pair methods
 }

 RDD{
   def map[K: ClassTag, V: ClassTag](f: T = (K,V)):PairRDD[K,V]
 }

 As long as the implicits remain, then compatibility remains, but now it
 is explicit in the docs on how to get a PairRDD and in tab completion.

 Thoughts?

 Justin Pihony





Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the
RDD API, but it could be even more discoverable if made available via the
API directly. I assume this was originally an omission that now needs to be
kept for backwards compatibility, but would any of the repo owners be open
to making this more discoverable to the point of API docs and tab
completion (while keeping both binary and source compatibility)?


class PairRDD extends RDD{
  pair methods
}

RDD{
  def map[K: ClassTag, V: ClassTag](f: T = (K,V)):PairRDD[K,V]
}

As long as the implicits remain, then compatibility remains, but now it is
explicit in the docs on how to get a PairRDD and in tab completion.

Thoughts?

Justin Pihony


Re: Spark logo license

2015-05-19 Thread Justin Pihony
Thanks!

On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:

 Check out Apache's trademark guidelines here:
 http://www.apache.org/foundation/marks/

 Matei

 On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com
 wrote:

 What is the license on using the spark logo. Is it free to be used for
 displaying commercially?




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Spark logo license

2015-05-19 Thread Justin Pihony
What is the license on using the spark logo. Is it free to be used for
displaying commercially?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Windows DOS bug in windows-utils.cmd

2015-05-19 Thread Justin Pihony
When running something like this:

spark-shell --jars foo.jar,bar.jar

This keeps failing to include the tail of the jars list. Digging into the
launch scripts I found that the comma makes it so that the list was sent as
separate parameters. So, to keep things together, I tried 

spark-shell --jars foo.jar, bar.jar

But, this still failed as the quotes carried over into some of the string
checks and resulted in invalid character errors. So, I am curious if anybody
sees a problem with making a PR to fix the script from

...
if x%2==x (
  echo %1 requires an argument. 2
  exit /b 1
)
set SUBMISSION_OPTS=%SUBMISSION_OPTS% %1 %2
...

TO

...
if x%~2==x (
  echo %1 requires an argument. 2
  exit /b 1
)
set SUBMISSION_OPTS=%SUBMISSION_OPTS% %1 %~2
...

The only difference is the use of the tilde to remove any surrounding quotes
if there are some. 

I figured I would ask here first to vet any unforeseen bugs this might cause
in other systems. As far as I know this should be harmless and only make it
so that comma separated lists will work in DOS.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Windows-DOS-bug-in-windows-utils-cmd-tp22946.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I am trying to print a basic twitter stream and receiving the following
error:


15/05/18 22:03:14 INFO Executor: Fetching
http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with
timestamp 1432000973058
15/05/18 22:03:14 INFO Utils: Fetching
http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to
C:\Users\Justin\AppData\Local\Temp\spark-4a37d3
e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp
15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja
va:715)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374)
at
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:366)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:744)


Code is:

spark-shell --jars
\Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar

import org.apache.spark.streaming.twitter._
import org.apache.spark.streaming._

System.setProperty(twitter4j.oauth.consumerKey,*)
System.setProperty(twitter4j.oauth.consumerSecret,*)
System.setProperty(twitter4j.oauth.accessToken,*)
System.setProperty(twitter4j.oauth.accessTokenSecret,*)

val ssc = new StreamingContext(sc, Seconds(10))
val stream = TwitterUtils.createStream(ssc, None)
stream.print
ssc.start


This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath,
a+x) but Im not sure why...



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I think I found the answer -
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html

Do I have no way of running this in Windows locally?


On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 I'm not 100% sure that is causing a problem, though. The stream still
 starts, but is giving blank output. I checked the environment variables in
 the ui and it is running local[*], so there should be no bottleneck there.

 On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 I am trying to print a basic twitter stream and receiving the following
 error:


 15/05/18 22:03:14 INFO Executor: Fetching
 http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with
 timestamp 1432000973058
 15/05/18 22:03:14 INFO Utils: Fetching
 http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to
 C:\Users\Justin\AppData\Local\Temp\spark-4a37d3

 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp
 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.NullPointerException
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja
 va:715)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443)
 at

 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374)
 at

 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366)
 at

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at

 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at

 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at
 org.apache.spark.executor.Executor.org
 $apache$spark$executor$Executor$$updateDependencies(Executor.scala:366)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:744)


 Code is:

 spark-shell --jars

 \Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar

 import org.apache.spark.streaming.twitter._
 import org.apache.spark.streaming._

 System.setProperty(twitter4j.oauth.consumerKey,*)
 System.setProperty(twitter4j.oauth.consumerSecret,*)
 System.setProperty(twitter4j.oauth.accessToken,*)
 System.setProperty(twitter4j.oauth.accessTokenSecret,*)

 val ssc = new StreamingContext(sc, Seconds(10))
 val stream = TwitterUtils.createStream(ssc, None)
 stream.print
 ssc.start


 This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath,
 a+x) but Im not sure why...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I'm not 100% sure that is causing a problem, though. The stream still
starts, but is giving blank output. I checked the environment variables in
the ui and it is running local[*], so there should be no bottleneck there.

On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com
wrote:

 I am trying to print a basic twitter stream and receiving the following
 error:


 15/05/18 22:03:14 INFO Executor: Fetching
 http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar with
 timestamp 1432000973058
 15/05/18 22:03:14 INFO Utils: Fetching
 http://192.168.56.1:49752/jars/twitter4j-media-support-3.0.3.jar to
 C:\Users\Justin\AppData\Local\Temp\spark-4a37d3

 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp
 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.NullPointerException
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:482)
 at org.apache.hadoop.util.Shell.run(Shell.java:455)
 at
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.ja
 va:715)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:873)
 at org.apache.hadoop.fs.FileUtil.chmod(FileUtil.java:853)
 at org.apache.spark.util.Utils$.fetchFile(Utils.scala:443)
 at

 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:374)
 at

 org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:366)
 at

 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
 at
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
 at

 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at
 org.apache.spark.executor.Executor.org
 $apache$spark$executor$Executor$$updateDependencies(Executor.scala:366)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:184)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:744)


 Code is:

 spark-shell --jars

 \Spark\lib\spark-streaming-twitter_2.10-1.3.1.jar,\Spark\lib\twitter4j-async-3.0.3.jar,\Spark\lib\twitter4j-core-3.0.3.jar,\Spark\lib\twitter4j-media-support-3.0.3.jar,\Spark\lib\twitter4j-stream-3.0.3.jar

 import org.apache.spark.streaming.twitter._
 import org.apache.spark.streaming._

 System.setProperty(twitter4j.oauth.consumerKey,*)
 System.setProperty(twitter4j.oauth.consumerSecret,*)
 System.setProperty(twitter4j.oauth.accessToken,*)
 System.setProperty(twitter4j.oauth.accessTokenSecret,*)

 val ssc = new StreamingContext(sc, Seconds(10))
 val stream = TwitterUtils.createStream(ssc, None)
 stream.print
 ssc.start


 This seems to be happening at FileUtil.chmod(targetFile.getAbsolutePath,
 a+x) but Im not sure why...



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/TwitterUtils-on-Windows-tp22939.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Trying to understand sc.textFile better

2015-05-17 Thread Justin Pihony
All,
I am trying to understand the textFile method deeply, but I think my
lack of deep Hadoop knowledge is holding me back here. Let me lay out my
understanding and maybe you can correct anything that is incorrect

When sc.textFile(path) is called, then defaultMinPartitions is used,
which is really just math.min(taskScheduler.defaultParallelism, 2). Let's
assume we are using the SparkDeploySchedulerBackend and this is 
conf.getInt(spark.default.parallelism, math.max(totalCoreCount.get(),
2))
So, now let's say the default is 2, going back to the textFile, this is
passed in to HadoopRDD. The true size is determined in getPartitions() using
inputFormat.getSplits(jobConf, minPartitions). But, from what I can find,
the partitions is merely a hint and is in fact mostly ignored, so you will
probably get the total number of blocks.
OK, this fits with expectations, however what if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?

Thanks,
Justin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Trying-to-understand-sc-textFile-better-tp22924.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
I started to play with 1.3.0 and found that there are a lot of breaking
changes. Previously, I could do the following:

case class Foo(x: Int)
val rdd = sc.parallelize(List(Foo(1)))
import sqlContext._
rdd.registerTempTable(foo)

Now, I am not able to directly use my RDD object and have it implicitly
become a DataFrame. It can be used as a DataFrameHolder, of which I could
write:

rdd.toDF.registerTempTable(foo)

But, that is kind of a pain in comparison. The other problem for me is that
I keep getting a SQLException:

java.sql.SQLException: Failed to start database 'metastore_db' with
class loader  sun.misc.Launcher$AppClassLoader@10393e97, see the next
exception for details.

This seems to be a dependency on Hive, when previously (1.2.0) there was no
such dependency. I can open tickets for these, but wanted to ask here
firstmaybe I am doing something wrong?

Thanks,
Justin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
It appears that the metastore_db problem is related to
https://issues.apache.org/jira/browse/SPARK-4758. I had another shell open
that was stuck. This is probably a bug, though?

import sqlContext.implicits
case class Foo(x: Int)
val rdd = sc.parallelize(List(Foo(1)))
rdd.toDF

results in a frozen shell after this line:

INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on
mysql: Lexical error at line 1, column 5.  Encountered: @ (64), after :
.

which, locks the internally created metastore_db


On Wed, Mar 18, 2015 at 11:20 AM, Justin Pihony justin.pih...@gmail.com
wrote:

 I started to play with 1.3.0 and found that there are a lot of breaking
 changes. Previously, I could do the following:

 case class Foo(x: Int)
 val rdd = sc.parallelize(List(Foo(1)))
 import sqlContext._
 rdd.registerTempTable(foo)

 Now, I am not able to directly use my RDD object and have it implicitly
 become a DataFrame. It can be used as a DataFrameHolder, of which I could
 write:

 rdd.toDF.registerTempTable(foo)

 But, that is kind of a pain in comparison. The other problem for me is that
 I keep getting a SQLException:

 java.sql.SQLException: Failed to start database 'metastore_db' with
 class loader  sun.misc.Launcher$AppClassLoader@10393e97, see the next
 exception for details.

 This seems to be a dependency on Hive, when previously (1.2.0) there was no
 such dependency. I can open tickets for these, but wanted to ask here
 firstmaybe I am doing something wrong?

 Thanks,
 Justin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Bug in Streaming files?

2015-03-14 Thread Justin Pihony
All,
Looking into  this StackOverflow question
https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469  
it appears that there is a bug when utilizing the newFilesOnly parameter in
FileInputDStream. Before creating a ticket, I wanted to verify it here. The
gist is that this code is wrong:

val modTimeIgnoreThreshold = math.max(
initialModTimeIgnoreThreshold,   // initial threshold based on
newFilesOnly setting
currentTime - durationToRemember.milliseconds  // trailing end of
the remember window
  )

The problem is that if you set newFilesOnly to false, then the
initialModTimeIgnoreThreshold is always 0. This makes it always dropped out
of the max operation. So, the best you get is files that were put in the
directory (duration) from the start. 

Is this a bug or expected behavior; it seems like a bug to me.

If I am correct, this appears to be a bigger fix than just using min as it
would break other functionality.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Bug-in-Streaming-files-tp22051.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL JSON array support

2015-03-05 Thread Justin Pihony
Is there any plans of supporting JSON arrays more fully? Take for example:

val myJson =
sqlContext.jsonRDD(List({foo:[{bar:1},{baz:2}]}))
myJson.registerTempTable(JsonTest)

I would like a way to pull out parts of the array data based on a key

sql(SELECT foo[bar] FROM JsonTest) //projects only the object
with bar, the rest would be null
 
I could even work around this if there was some way to access the key name
from the SchemaRDD:

myJson.filter(x=x(0).asInstanceOf[Seq[Row]].exists(y=y.key == bar))
.map(x=x(0).asInstanceOf[Seq[Row]].filter(y=y.key == bar)) 
//This does the same as above, except also filtering out those without a
bar key

This is the closest suggestion I could find thus far,
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+LateralView 
which still does not solve the problem of pulling out the keys.

I tried with a UDF also, but could not currently make that work either.

If there isn't anything in the works, then would it be appropriate to create
a ticket for this?

Thanks,
Justin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
Thanks!

On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com
wrote:

 It is somewhat out of data, but here is what we have so far:
 https://github.com/marmbrus/sql-typed

 On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 I am pretty sure that I saw a presentation where SparkSQL could be
 executed
 with static analysis, however I cannot find the presentation now, nor can
 I
 find any documentation or research papers on the topic. So, I am curious
 if
 there is indeed any work going on for this topic. The two things I would
 be
 interested in would be to be able to gain compile time safety, as well as
 gain the ability to work on my data as a type instead of a row (ie,
 result.map(x=x.Age) instead of having to use Row.get)





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Static-Analysis-tp21918.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
I am pretty sure that I saw a presentation where SparkSQL could be executed
with static analysis, however I cannot find the presentation now, nor can I
find any documentation or research papers on the topic. So, I am curious if
there is indeed any work going on for this topic. The two things I would be
interested in would be to be able to gain compile time safety, as well as
gain the ability to work on my data as a type instead of a row (ie,
result.map(x=x.Age) instead of having to use Row.get)





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Static-Analysis-tp21918.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
Per the documentation:

  It is important to make sure that the structure of every Row of the
provided RDD matches the provided schema. Otherwise, there will be runtime
exception.

However, it appears that this is not being enforced. 

import org.apache.spark.sql._
val sqlContext = new SqlContext(sc)
val struct = StructType(List(StructField(test, BooleanType, true)))
val myData = sc.parallelize(List(Row(0), Row(true), Row(stuff)))
val schemaData = sqlContext.applySchema(myData, struct) //No error
schemaData.collect()(0).getBoolean(0) //Only now will I receive an error

Is this expected or a bug?

Thanks,
Justin



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
OK, but what about on an action, like collect()? Shouldn't it be able to
determine the correctness at that time?

On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote:

 Hi Justin,

 It is expected. We do not check if the provided schema matches rows since
 all rows need to be scanned to give a correct answer.

 Thanks,

 Yin

 On Fri, Feb 13, 2015 at 1:33 PM, Justin Pihony justin.pih...@gmail.com
 wrote:

 Per the documentation:

   It is important to make sure that the structure of every Row of the
 provided RDD matches the provided schema. Otherwise, there will be runtime
 exception.

 However, it appears that this is not being enforced.

 import org.apache.spark.sql._
 val sqlContext = new SqlContext(sc)
 val struct = StructType(List(StructField(test, BooleanType, true)))
 val myData = sc.parallelize(List(Row(0), Row(true), Row(stuff)))
 val schemaData = sqlContext.applySchema(myData, struct) //No error
 schemaData.collect()(0).getBoolean(0) //Only now will I receive an error

 Is this expected or a bug?

 Thanks,
 Justin



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org