Re: uncontinuous offset in kafka will cause the spark streaming failure

2018-01-23 Thread Justin Miller
scale stream processor tomorrow. CCing: Cody Koeninger Best, Justin > On Jan 23, 2018, at 10:48 PM, namesuperwood <namesuperw...@gmail.com> wrote: > > Hi all > > kafka version : kafka_2.11-0.11.0.2 >spark version : 2.0.1 > > A topic-pa

Re: "Got wrong record after seeking to offset" issue

2018-01-18 Thread Justin Miller
org/jira/browse/SPARK-17147> We are compacting topics, but only offset topics. We just updated our message version to 0.10 today as our last non-Spark project was brought up to 0.11 (Storm based). Justin > On Jan 18, 2018, at 1:39 PM, Cody Koeninger <c...@koeninger.org> wr

Re: "Got wrong record after seeking to offset" issue

2018-01-17 Thread Justin Miller
! Justin On Wednesday, January 17, 2018, Cody Koeninger <c...@koeninger.org> wrote: > That means the consumer on the executor tried to seek to the specified > offset, but the message that was returned did not have a matching > offset. If the executor can't get the messages

"Got wrong record after seeking to offset" issue

2018-01-16 Thread Justin Miller
ter that’s crashing). Thank you! Justin - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Forcing either Hive or Spark SQL representation for metastore

2017-05-18 Thread Justin Miller
etting the following for all: Caused by: MetaException(message:java.io.IOException: Got exception: java.io.IOException /username/sys_encrypted/staging/raw/updatedTimeYear=2017/updatedTimeMonth=5/updatedTimeDay=16/updatedTimeHour=23 doesn't exist) Thanks! Justin

Is there a way to tell if a receiver is a Reliable Receiver?

2017-04-17 Thread Justin Pihony
I can't seem to find anywhere that would let a user know if the receiver they are using is reliable or not. Even better would be a list of known reliable receivers. Are any of these things possible? Or do you just have to research your receiver beforehand? -- View this message in context:

Avro/Parquet GenericFixed decimal is not read into Spark correctly

2017-04-12 Thread Justin Pihony
All, Before creating a JIRA for this I wanted to get a sense as to whether it would be shot down or not: Take the following code: spark-shell --packages org.apache.avro:avro:1.8.1 import org.apache.avro.{Conversions, LogicalTypes, Schema} import java.math.BigDecimal val dc = new

Spark Streaming Kafka Job has strange behavior for certain tasks

2017-04-05 Thread Justin Miller
compute.internal (executor 53) (96/96) Thanks, Justin - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

SparkStreaming getActiveOrCreate

2017-03-18 Thread Justin Pihony
nything explicit -Justin Pihony -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkStreaming-getActiveOrCreate-tp28508.html Sent from the Apache Spark User List mailing list archi

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Justin Miller
I've created a ticket here: https://issues.apache.org/jira/browse/SPARK-19888 <https://issues.apache.org/jira/browse/SPARK-19888> Thanks, Justin > On Mar 10, 2017, at 1:14 PM, Michael Armbrust <mich...@databricks.com> wrote: > > If you have a reproduction you should o

Re: How to gracefully handle Kafka OffsetOutOfRangeException

2017-03-10 Thread Justin Miller
Hi Michael, I'm experiencing a similar issue. Will this not be fixed in Spark Streaming? Best, Justin > On Mar 10, 2017, at 8:34 AM, Michael Armbrust <mich...@databricks.com> wrote: > > One option here would be to try Structured Streaming. We've added an option &g

Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
I've verified this is that issue, so please disregard. On Wed, Mar 1, 2017 at 1:07 AM, Justin Pihony <justin.pih...@gmail.com> wrote: > As soon as I posted this I found https://issues.apache. > org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at > it deeper

Re: Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
As soon as I posted this I found https://issues.apache.org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at it deeper now. On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com> wrote: > Run spark-shell --packages > datastax:spark-cassandra-connect

Jar not in shell classpath in Windows 10

2017-02-28 Thread Justin Pihony
Run spark-shell --packages datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an import of anything com.datastax. I have checked that the jar is listed among the classpaths and it is, albeit behind a spark URL. I'm wondering if added jars fail in windows due to this server

Is there a list of missing optimizations for typed functions?

2017-02-22 Thread Justin Pihony
't find anything in JIRA. Thanks, Justin Pihony -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-list-of-missing-optimizations-for-typed-functions-tp28418.html Sent from the Apache Spark User List mailing list archi

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Justin Miller
> attached to all of the code that is currently intended for the associated > release number. > > On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <justin.mil...@protectwise.com > <mailto:justin.mil...@protectwise.com>> wrote: > It looks like the jars for 2.1.0-SNAP

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-28 Thread Justin Miller
ec 28 20:01:10 UTC 2016 2.2.0-SNAPSHOT/ <https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/> Wed Dec 28 19:12:38 UTC 2016 What's with 2.1.1-SNAPSHOT? Is that version about to be released as well? Thanks! Justin >

Re: Is there any scheduled release date for Spark 2.1.0?

2016-12-23 Thread Justin Miller
I'm curious about this as well. Seems like the vote passed. > On Dec 23, 2016, at 2:00 AM, Aseem Bansal wrote: > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

K-Means seems biased to one center

2015-10-05 Thread Justin Pihony
(Cross post with http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center) I have a corpus of wiki pages (baseball, hockey, music, football) which I'm running through tfidf and then through kmeans. After a couple issues to start (you can see my previous questions),

save checkpoint during dataframe row iteration

2015-10-05 Thread Justin Permar
Good morning, I have a typical iterator loop on a DataFrame loaded from a parquet data source: val conf = new SparkConf().setAppName("Simple Application").setMaster("local") val sc = new JavaSparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) val parquetDataFrame =

Re: Is MLBase dead?

2015-09-28 Thread Justin Pihony
To take a stab at my own answer: MLBase is now fully integrated into MLLib. MLI/MLLib are the mllib algorithms and MLO is the ml pipelines? On Mon, Sep 28, 2015 at 10:19 PM, Justin Pihony <justin.pih...@gmail.com> wrote: > As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.sp

Is MLBase dead?

2015-09-28 Thread Justin Pihony
As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and org.apache.spark.ml? I cannot find anything official, and the last updates seem to be a year or two old. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html Sent

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
to: http://ec2_publicdns:20888/proxy/applicationid/jobs (9046 is the older emr port) or, as Jonathan said, the spark history server works once a job is completed. On Tue, Aug 25, 2015 at 5:26 PM, Justin Pihony justin.pih...@gmail.com wrote: OK, I figured the horrid look alsothe href

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
18080. Hope that helps! ~ Jonathan On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote: I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
OK, I figured the horrid look alsothe href of all of the styles is prefixed with the proxy dataso, ultimately if I can fix the proxy issues with the links, then I can fix the look also On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com wrote: SUCCESS! I set

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
as the UI looks horridbut I'll tackle that next :) On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com wrote: Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com wrote: When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two

Re: Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
at 11:58 AM, Justin Pihony justin.pih...@gmail.com wrote: Additional info...If I use an online md5sum check then it matches...So, it's either windows or python (using 2.7.10) On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com wrote: When running the spark_ec2.py script, I'm

Got wrong md5sum for boto

2015-08-24 Thread Justin Pihony
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now seen this on two different machines. I am running on windows, but I would imagine that shouldn't affect the md5. Is this a boto problem, python problem, spark problem? -- View this message in context:

How to access Spark UI through AWS

2015-08-24 Thread Justin Pihony
I am using the steps from this article https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to get spark up and running on EMR through yarn. Once up and running I ssh in and cd to the spark bin and run spark-shell --master yarn. Once this spins up I can see that the UI is started

Windowed stream operations -- These are too lazy for some use cases

2015-08-20 Thread Justin Grimes
We are aggregating real time logs of events, and want to do windows of 30 minutes. However, since the computation doesn't start until 30 minutes have passed, there is a ton of data built up that processing could've already started on. When it comes time to actually process the data, there is too

Re: Windowed stream operations -- These are too lazy for some use cases

2015-08-20 Thread Justin Grimes
of processing overhead -- I couldn't figure out exactly why but it seemed to have something to do with forEachRDD only being executed on the driver. On Thu, Aug 20, 2015 at 1:39 PM, Iulian Dragoș iulian.dra...@typesafe.com wrote: On Thu, Aug 20, 2015 at 6:58 PM, Justin Grimes jgri...@adzerk.com wrote: We

Spark Python process

2015-06-24 Thread Justin Steigel
I have a spark job that's running on a 10 node cluster and the python process on all the nodes is pegged at 100%. I was wondering what parts of a spark script are run in the python process and which get passed to the Java processes? Is there any documentation on this? Thanks, Justin

NaiveBayes for MLPipeline is absent

2015-06-18 Thread Justin Yip
Hello, Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't find the JIRA ticket related to it too (or maybe I missed). Is there a plan to implement it? If no one has the bandwidth, I can work on it. Thanks. Justin -- View this message in context: http://apache

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Justin Yip
Done. https://issues.apache.org/jira/browse/SPARK-8420 Justin On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: That sounds like a bug. Could you create a JIRA and ping Yin Huai (cc'ed). -Xiangrui On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
$lzycompute(random.scala:39) at org.apache.spark.sql.catalyst.expressions.RDG.rng(random.scala:39) .. Does any one know why? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-with-functions-rand-tp23267.html Sent from the Apache

Why the default Params.copy doesn't work for Model.copy?

2015-06-04 Thread Justin Yip
feature generation transformer like StringIndexerModel cannot be used in Pipeline. Maybe due to my limited knowledge in ML pipeline, can anyone give me some hints why Model.copy behaves differently as other Params? Thanks! Justin -- View this message in context: http://apache-spark-user-list

Setting S3 output file grantees for spark output files

2015-06-04 Thread Justin Steigel
Hi all, I'm running Spark on AWS EMR and I'm having some issues getting the correct permissions on the output files using rdd.saveAsTextFile('file_dir_name'). In hive, I would add a line in the beginning of the script with set fs.s3.canned.acl=BucketOwnerFullControl and that would set the

Python Image Library and Spark

2015-06-03 Thread Justin Spargur
Hi all, I'm playing around with manipulating images via Python and want to utilize Spark for scalability. That said, I'm just learing Spark and my Python is a bit rusty (been doing PHP coding for the last few years). I think I have most of the process figured out. However, the script fails

Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-05-27 Thread Justin Yip
: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp] scala df.filter($_2 = 2014-06-01).show +--+--+ |_1|_2| +--+--+ +--+--+ Not sure if that is intended, but I cannot find any doc mentioning these inconsistencies. Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Accumulators in Spark Streaming on UI

2015-05-26 Thread Justin Pihony
You need to make sure to name the accumulator. On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com wrote: Hello all, I have accumulator in spark streaming application which counts number of events received from Kafka. From the documentation , It seems Spark UI has

Building scaladoc using build/sbt unidoc failure

2015-05-26 Thread Justin Yip
/docs and used the following command: $ build/sbt unidoc Please see attachment for detailed error. Did I miss anything? Thanks. Justin unidoc_error.txt (30K) http://apache-spark-user-list.1001560.n3.nabble.com/attachment/23044/0/unidoc_error.txt -- View this message in context: http

Re: Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
for K and V separately. On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com wrote: This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved the RDD API, but it could be even more discoverable if made available via the API directly. I assume

Why is RDD to PairRDDFunctions only via implicits?

2015-05-22 Thread Justin Pihony
as the implicits remain, then compatibility remains, but now it is explicit in the docs on how to get a PairRDD and in tab completion. Thoughts? Justin Pihony

Re: User Defined Type (UDT)

2015-05-20 Thread Justin Uang
Xiangrui, is there a timeline for when UDTs will become a public API? I'm currently using them to support java 8's ZonedDateTime. On Tue, May 19, 2015 at 3:14 PM Xiangrui Meng men...@gmail.com wrote: (Note that UDT is not a public API yet.) On Thu, May 7, 2015 at 7:11 AM, wjur

Re: Spark logo license

2015-05-19 Thread Justin Pihony
Thanks! On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com wrote: What is the license on using

Spark logo license

2015-05-19 Thread Justin Pihony
What is the license on using the spark logo. Is it free to be used for displaying commercially? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Windows DOS bug in windows-utils.cmd

2015-05-19 Thread Justin Pihony
When running something like this: spark-shell --jars foo.jar,bar.jar This keeps failing to include the tail of the jars list. Digging into the launch scripts I found that the comma makes it so that the list was sent as separate parameters. So, to keep things together, I tried

TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
-media-support-3.0.3.jar to C:\Users\Justin\AppData\Local\Temp\spark-4a37d3 e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp 15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.NullPointerException

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I think I found the answer - http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html Do I have no way of running this in Windows locally? On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com wrote

Re: TwitterUtils on Windows

2015-05-18 Thread Justin Pihony
I'm not 100% sure that is causing a problem, though. The stream still starts, but is giving blank output. I checked the environment variables in the ui and it is running local[*], so there should be no bottleneck there. On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com wrote

Trying to understand sc.textFile better

2015-05-17 Thread Justin Pihony
if the default is not used and you provide a partition size that is larger than the block size. If my research is right and the getSplits call simply ignores this parameter, then wouldn't the provided min end up being ignored and you would still just get the block size? Thanks, Justin -- View

Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Justin Yip
Hello, I am using MLPipeline. I would like to extract the best parameter found by CrossValidator. But I cannot find much document about how to do it. Can anyone give me some pointers? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com

Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
: int, _2: int] scala df.join(df2.withColumnRenamed(_1, right_key), $_1 === $right_key).printSchema Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html Sent from the Apache

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread Justin Yip
time of each user, and join it back to the original data frame. But that involves two shuffles. Hence would like to see if there are ways to improve the performance. Thanks. Justin On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote: can you kindly elaborate on this? it should

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
) at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161) ... Thanks! Justin On Fri, May 15, 2015 at 3:55 PM, Michael Armbrust mich...@databricks.com wrote: There are several ways to solve this ambiguity: *1. use the DataFrames to get the attribute so its already resolved and not just a string we need

Custom Aggregate Function for DataFrame

2015-05-14 Thread Justin Yip
Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF functions which applies on a Row. Maybe I have missed something. Thanks. Justin -- View this message in context: http://apache-spark

Creating StructType with DataFrame.withColumn

2015-04-30 Thread Justin Yip
missed something crucial. Can anyone give me some pointers? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-StructType-with-DataFrame-withColumn-tp22715.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem: df.withColumn(millis, $eventTime.cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip yipjus...@prediction.io wrote: Hello, I was able to cast a timestamp into long using

casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
to this failure? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/casting-timestamp-into-long-fail-in-Spark-1-3-1-tp22727.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Powered By Spark

2015-04-27 Thread Justin
the relevance of programmatic marketing Thanks! Justin Justin Barton CTO +1 (718) 404 9272 +44 203 290 9272 atp.io | jus...@atp.io | find us https://atp.io/find-us

Column renaming after DataFrame.groupBy

2015-04-21 Thread Justin Yip
(nullable = false) |-- SUM(_1#179): long (nullable = true) |-- SUM(_2#180): long (nullable = true) Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-tp22586.html Sent from the Apache Spark User List

Re: Expected behavior for DataFrame.unionAll

2015-04-14 Thread Justin Yip
That explains it. Thanks Reynold. Justin On Mon, Apr 13, 2015 at 11:26 PM, Reynold Xin r...@databricks.com wrote: I think what happened was applying the narrowest possible type. Type widening is required, and as a result, the narrowest type is string between a string and an int. https

Catching executor exception from executor in driver

2015-04-14 Thread Justin Yip
$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(console:51) ... CAUSE: null The exception cause is a null value. Is there any way that I can catch the ArithmeticException? Thanks Justin -- View this message in context: http://apache

Fwd: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
: string (nullable = true) The unionAll documentation says it behaves like the SQL UNION ALL function. However, unioning incompatible types is not well defined for SQL. Is there any expected behavior for unioning incompatible data frames? Thanks. Justin -- View this message in context: http

Re: How to use Joda Time with Spark SQL?

2015-04-12 Thread Justin Yip
, but if they are available as UDT and provided by the SparkSQL library, that will make DataFrame users' life easier. Justin On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com wrote: One possible approach can be defining a UDT (user-defined type) for Joda time. A UDT maps an arbitrary

The $ notation for DataFrame Column

2015-04-10 Thread Justin Yip
Hello, The DataFrame documentation always uses $columnX to annotates a column. But I cannot find much information about it. Maybe I have missed something. Can anyone point me to the doc about the $, if there is any? Thanks. Justin

DataFrame column name restriction

2015-04-10 Thread Justin Yip
: cannot resolve 'b.b' given input columns a, b.b; at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) Thanks. Justin

Re: DataFrame groupBy MapType

2015-04-07 Thread Justin Yip
Thanks Michael. Will submit a ticket. Justin On Mon, Apr 6, 2015 at 1:53 PM, Michael Armbrust mich...@databricks.com wrote: I'll add that I don't think there is a convenient way to do this in the Column API ATM, but would welcome a JIRA for adding it :) On Mon, Apr 6, 2015 at 1:45 PM

Expected behavior for DataFrame.unionAll

2015-04-07 Thread Justin Yip
: string (nullable = true) The unionAll documentation says it behaves like the SQL UNION ALL function. However, unioning incompatible types is not well defined for SQL. Is there any expected behavior for unioning incompatible data frames? Thanks. Justin

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
is that DataFrame cannot leverage its columnar format after persisting in memory. But cannot find anywhere from the doc mentioning this. Did I miss anything? Thanks! Justin

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io wrote: Hello

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Thanks for the explanation Yin. Justin On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai yh...@databricks.com wrote: I think the slowness is caused by the way that we serialize/deserialize the value of a complex type. I have opened https://issues.apache.org/jira/browse/SPARK-6759 to track

DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
such operation? Can't find much info about operating MapType on Column in the doc. Thanks ahead! Justin

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
. Justin On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote: I am new to MLib so I have a basic question: is it possible to save MLlib models (particularly CF models) to HDFS and then reload it later? If yes, could u share some sample code (I could not find it in MLlib tutorial

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
Thanks Xiangrui, I used 80 iterations to demonstrates the marginal diminishing return in prediction quality :) Justin On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote: I think before 1.3 you also get stackoverflow problem in ~35 iterations. In 1.3.x, please use

StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
. Is there any change to the ALS algorithm? And are there any ways to achieve more iterations? Thanks. Justin

Catching InvalidClassException in sc.objectFile

2015-03-19 Thread Justin Yip
in the driver program. However, both getCause() and getSuppressed() are empty. What is the recommended way of catching this exception? Thanks. Justin

Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Did DataFrames break basic SQLContext?

2015-03-18 Thread Justin Pihony
results in a frozen shell after this line: INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on mysql: Lexical error at line 1, column 5. Encountered: @ (64), after : . which, locks the internally created metastore_db On Wed, Mar 18, 2015 at 11:20 AM, Justin Pihony justin.pih

Bug in Streaming files?

2015-03-14 Thread Justin Pihony
All, Looking into this StackOverflow question https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469 it appears that there is a bug when utilizing the newFilesOnly parameter in FileInputDStream. Before creating a ticket, I wanted to verify it here. The gist is that this

SparkSQL JSON array support

2015-03-05 Thread Justin Pihony
, but could not currently make that work either. If there isn't anything in the works, then would it be appropriate to create a ticket for this? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html Sent from

Re: Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
Thanks! On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com wrote: It is somewhat out of data, but here is what we have so far: https://github.com/marmbrus/sql-typed On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com wrote: I am pretty sure that I

Spark SQL Static Analysis

2015-03-04 Thread Justin Pihony
I am pretty sure that I saw a presentation where SparkSQL could be executed with static analysis, however I cannot find the presentation now, nor can I find any documentation or research papers on the topic. So, I am curious if there is indeed any work going on for this topic. The two things I

SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
or a bug? Thanks, Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: SQLContext.applySchema strictness

2015-02-13 Thread Justin Pihony
OK, but what about on an action, like collect()? Shouldn't it be able to determine the correctness at that time? On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote: Hi Justin, It is expected. We do not check if the provided schema matches rows since all rows need

Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
would pop and merge the combiners of each key in order, resulting in [(A, 1, ...), (A, 2, ...), (A, 3, ...), (A, 4, ...)]. Thanks in advance for the help! If there is a way to do this already in Spark 1.2, can someone point it out to me? Best, Justin

Re: Aggregate order semantics when spilling

2015-01-20 Thread Justin Uang
is trivial to runtime, and this doesn't break any backcompat. Thanks, Justin On Tue, Jan 20, 2015 at 8:03 PM, Andrew Or and...@databricks.com wrote: Hi Justin, I believe the intended semantics of groupByKey or cogroup is that the ordering *within a key *is not preserved if you spill. In fact

Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Hello, From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find it anywhere. Do I need to specify anything in the spark ui config? Thanks. Justin

Re: Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Found it. Thanks Patrick. Justin On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell pwend...@gmail.com wrote: It should appear in the page for any stage in which accumulators are updated. On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io wrote: Hello, From

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Justin Yip
Xuelin, There is a function called emtpyRDD under spark context http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext which serves this purpose. Justin On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote: Hi, I'd like to create

IOException: exception in uploadSinglePart

2014-11-17 Thread Justin Mills
) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any ideas of what the underlying issue could be here? It feels like a timeout issue, but that's just a guess. Thanks, Justin

MLLib sample data format

2014-06-22 Thread Justin Yip
Hello, I am looking into a couple of MLLib data files in https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any explanation for these files? Does anyone know if they are documented? Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
Hi Shuo, Yes. I was reading the guide as well as the sample code. For example, in http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm, nowhere in the github repository I can find the file: sc.textFile( mllib/data/ridge-data/lpsa.data). Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
. Justin On Sun, Jun 22, 2014 at 2:40 PM, Shuo Xiang shuoxiang...@gmail.com wrote: Hi, you might find http://spark.apache.org/docs/latest/mllib-guide.html helpful. On Sun, Jun 22, 2014 at 2:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data files

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
I see. That's good. Thanks. Justin On Sun, Jun 22, 2014 at 4:59 PM, Evan Sparks evan.spa...@gmail.com wrote: Oh, and the movie lens one is userid::movieid::rating - Evan On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote: Hello, I am looking into a couple of MLLib data

MLLib : Decision Tree with minimum points per node

2014-06-13 Thread Justin Yip
contain very few samples, which potentially leads to overfitting. I would like to know if there is workaround or any way to prevent overfitting? Or will decision tree supports min-samples-per-node in future releases? Thanks. Justin

KyroException: Unable to find class

2014-06-05 Thread Justin Yip
to see the same classpaths. Is there anyway to remedy this issue? Thanks. Justin