scale stream processor tomorrow.
CCing: Cody Koeninger
Best,
Justin
> On Jan 23, 2018, at 10:48 PM, namesuperwood <namesuperw...@gmail.com> wrote:
>
> Hi all
>
> kafka version : kafka_2.11-0.11.0.2
>spark version : 2.0.1
>
> A topic-pa
org/jira/browse/SPARK-17147>
We are compacting topics, but only offset topics. We just updated our message
version to 0.10 today as our last non-Spark project was brought up to 0.11
(Storm based).
Justin
> On Jan 18, 2018, at 1:39 PM, Cody Koeninger <c...@koeninger.org> wr
!
Justin
On Wednesday, January 17, 2018, Cody Koeninger <c...@koeninger.org> wrote:
> That means the consumer on the executor tried to seek to the specified
> offset, but the message that was returned did not have a matching
> offset. If the executor can't get the messages
ter that’s crashing).
Thank you!
Justin
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
etting the following for all:
Caused by: MetaException(message:java.io.IOException: Got exception:
java.io.IOException
/username/sys_encrypted/staging/raw/updatedTimeYear=2017/updatedTimeMonth=5/updatedTimeDay=16/updatedTimeHour=23
doesn't exist)
Thanks!
Justin
I can't seem to find anywhere that would let a user know if the receiver they
are using is reliable or not. Even better would be a list of known reliable
receivers. Are any of these things possible? Or do you just have to research
your receiver beforehand?
--
View this message in context:
All,
Before creating a JIRA for this I wanted to get a sense as to whether it
would be shot down or not:
Take the following code:
spark-shell --packages org.apache.avro:avro:1.8.1
import org.apache.avro.{Conversions, LogicalTypes, Schema}
import java.math.BigDecimal
val dc = new
compute.internal (executor 53)
(96/96)
Thanks,
Justin
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
nything
explicit
-Justin Pihony
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkStreaming-getActiveOrCreate-tp28508.html
Sent from the Apache Spark User List mailing list archi
I've created a ticket here: https://issues.apache.org/jira/browse/SPARK-19888
<https://issues.apache.org/jira/browse/SPARK-19888>
Thanks,
Justin
> On Mar 10, 2017, at 1:14 PM, Michael Armbrust <mich...@databricks.com> wrote:
>
> If you have a reproduction you should o
Hi Michael,
I'm experiencing a similar issue. Will this not be fixed in Spark Streaming?
Best,
Justin
> On Mar 10, 2017, at 8:34 AM, Michael Armbrust <mich...@databricks.com> wrote:
>
> One option here would be to try Structured Streaming. We've added an option
&g
I've verified this is that issue, so please disregard.
On Wed, Mar 1, 2017 at 1:07 AM, Justin Pihony <justin.pih...@gmail.com>
wrote:
> As soon as I posted this I found https://issues.apache.
> org/jira/browse/SPARK-18648 which seems to be the issue. I'm looking at
> it deeper
As soon as I posted this I found
https://issues.apache.org/jira/browse/SPARK-18648 which seems to be the
issue. I'm looking at it deeper now.
On Wed, Mar 1, 2017 at 1:05 AM, Justin Pihony <justin.pih...@gmail.com>
wrote:
> Run spark-shell --packages
> datastax:spark-cassandra-connect
Run spark-shell --packages
datastax:spark-cassandra-connector:2.0.0-RC1-s_2.11 and then try to do an
import of anything com.datastax. I have checked that the jar is listed among
the classpaths and it is, albeit behind a spark URL. I'm wondering if added
jars fail in windows due to this server
't find anything in JIRA.
Thanks,
Justin Pihony
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-list-of-missing-optimizations-for-typed-functions-tp28418.html
Sent from the Apache Spark User List mailing list archi
> attached to all of the code that is currently intended for the associated
> release number.
>
> On Wed, Dec 28, 2016 at 3:09 PM, Justin Miller <justin.mil...@protectwise.com
> <mailto:justin.mil...@protectwise.com>> wrote:
> It looks like the jars for 2.1.0-SNAP
ec 28 20:01:10 UTC 2016
2.2.0-SNAPSHOT/
<https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-sql_2.11/2.2.0-SNAPSHOT/>
Wed Dec 28 19:12:38 UTC 2016
What's with 2.1.1-SNAPSHOT? Is that version about to be released as well?
Thanks!
Justin
>
I'm curious about this as well. Seems like the vote passed.
> On Dec 23, 2016, at 2:00 AM, Aseem Bansal wrote:
>
>
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
(Cross post with
http://stackoverflow.com/questions/32936380/k-means-clustering-is-biased-to-one-center)
I have a corpus of wiki pages (baseball, hockey, music, football) which I'm
running through tfidf and then through kmeans. After a couple issues to
start (you can see my previous questions),
Good morning,
I have a typical iterator loop on a DataFrame loaded from a parquet data
source:
val conf = new SparkConf().setAppName("Simple
Application").setMaster("local")
val sc = new JavaSparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val parquetDataFrame =
To take a stab at my own answer: MLBase is now fully integrated into MLLib.
MLI/MLLib are the mllib algorithms and MLO is the ml pipelines?
On Mon, Sep 28, 2015 at 10:19 PM, Justin Pihony <justin.pih...@gmail.com>
wrote:
> As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.sp
As in, is MLBase (MLO/MLI/MLlib) now simply org.apache.spark.mllib and
org.apache.spark.ml? I cannot find anything official, and the last updates
seem to be a year or two old.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-MLBase-dead-tp24854.html
Sent
to:
http://ec2_publicdns:20888/proxy/applicationid/jobs (9046 is the older
emr port)
or, as Jonathan said, the spark history server works once a job is
completed.
On Tue, Aug 25, 2015 at 5:26 PM, Justin Pihony justin.pih...@gmail.com
wrote:
OK, I figured the horrid look alsothe href
18080. Hope that helps!
~ Jonathan
On 8/24/15, 10:51 PM, Justin Pihony justin.pih...@gmail.com wrote:
I am using the steps from this article
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to
get spark up and running on EMR through yarn. Once up and running I ssh
OK, I figured the horrid look alsothe href of all of the styles is
prefixed with the proxy dataso, ultimately if I can fix the proxy
issues with the links, then I can fix the look also
On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony justin.pih...@gmail.com
wrote:
SUCCESS! I set
as the UI looks horridbut I'll tackle that next :)
On Tue, Aug 25, 2015 at 4:31 PM, Justin Pihony justin.pih...@gmail.com
wrote:
Thanks. I just tried and still am having trouble. It seems to still be
using the private address even if I try going through the resource manager.
On Tue, Aug 25, 2015
Additional info...If I use an online md5sum check then it matches...So,
it's either windows or python (using 2.7.10)
On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com
wrote:
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
seen this on two
at 11:58 AM, Justin Pihony justin.pih...@gmail.com
wrote:
Additional info...If I use an online md5sum check then it matches...So,
it's either windows or python (using 2.7.10)
On Mon, Aug 24, 2015 at 11:54 AM, Justin Pihony justin.pih...@gmail.com
wrote:
When running the spark_ec2.py script, I'm
When running the spark_ec2.py script, I'm getting a wrong md5sum. I've now
seen this on two different machines. I am running on windows, but I would
imagine that shouldn't affect the md5. Is this a boto problem, python
problem, spark problem?
--
View this message in context:
I am using the steps from this article
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923 to
get spark up and running on EMR through yarn. Once up and running I ssh in
and cd to the spark bin and run spark-shell --master yarn. Once this spins
up I can see that the UI is started
We are aggregating real time logs of events, and want to do windows of 30
minutes. However, since the computation doesn't start until 30 minutes have
passed, there is a ton of data built up that processing could've already
started on. When it comes time to actually process the data, there is too
of processing overhead -- I couldn't figure out exactly why but it
seemed to have something to do with forEachRDD only being executed on the
driver.
On Thu, Aug 20, 2015 at 1:39 PM, Iulian Dragoș iulian.dra...@typesafe.com
wrote:
On Thu, Aug 20, 2015 at 6:58 PM, Justin Grimes jgri...@adzerk.com wrote:
We
I have a spark job that's running on a 10 node cluster and the python
process on all the nodes is pegged at 100%.
I was wondering what parts of a spark script are run in the python process
and which get passed to the Java processes? Is there any documentation on
this?
Thanks,
Justin
Hello,
Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't
find the JIRA ticket related to it too (or maybe I missed).
Is there a plan to implement it? If no one has the bandwidth, I can work on
it.
Thanks.
Justin
--
View this message in context:
http://apache
Done.
https://issues.apache.org/jira/browse/SPARK-8420
Justin
On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:
That sounds like a bug. Could you create a JIRA and ping Yin Huai
(cc'ed). -Xiangrui
On Wed, May 27, 2015 at 12:57 AM, Justin Yip yipjus...@prediction.io
$lzycompute(random.scala:39)
at org.apache.spark.sql.catalyst.expressions.RDG.rng(random.scala:39)
..
Does any one know why?
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-with-functions-rand-tp23267.html
Sent from the Apache
feature generation transformer like StringIndexerModel
cannot be used in Pipeline.
Maybe due to my limited knowledge in ML pipeline, can anyone give me some
hints why Model.copy behaves differently as other Params?
Thanks!
Justin
--
View this message in context:
http://apache-spark-user-list
Hi all,
I'm running Spark on AWS EMR and I'm having some issues getting the correct
permissions on the output files using
rdd.saveAsTextFile('file_dir_name'). In hive, I would add a line in the
beginning of the script with
set fs.s3.canned.acl=BucketOwnerFullControl
and that would set the
Hi all,
I'm playing around with manipulating images via Python and want to
utilize Spark for scalability. That said, I'm just learing Spark and my
Python is a bit rusty (been doing PHP coding for the last few years). I
think I have most of the process figured out. However, the script fails
: org.apache.spark.sql.DataFrame = [_1: int, _2: timestamp]
scala df.filter($_2 = 2014-06-01).show
+--+--+
|_1|_2|
+--+--+
+--+--+
Not sure if that is intended, but I cannot find any doc mentioning these
inconsistencies.
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3
You need to make sure to name the accumulator.
On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote nagmote.sne...@gmail.com
wrote:
Hello all,
I have accumulator in spark streaming application which counts number of
events received from Kafka.
From the documentation , It seems Spark UI has
/docs and used the
following command:
$ build/sbt unidoc
Please see attachment for detailed error. Did I miss anything?
Thanks.
Justin
unidoc_error.txt (30K)
http://apache-spark-user-list.1001560.n3.nabble.com/attachment/23044/0/unidoc_error.txt
--
View this message in context:
http
for K and V separately.
On Fri, May 22, 2015 at 10:26 AM, Justin Pihony justin.pih...@gmail.com
wrote:
This ticket https://issues.apache.org/jira/browse/SPARK-4397 improved
the RDD API, but it could be even more discoverable if made available via
the API directly. I assume
as the implicits remain, then compatibility remains, but now it is
explicit in the docs on how to get a PairRDD and in tab completion.
Thoughts?
Justin Pihony
Xiangrui, is there a timeline for when UDTs will become a public API? I'm
currently using them to support java 8's ZonedDateTime.
On Tue, May 19, 2015 at 3:14 PM Xiangrui Meng men...@gmail.com wrote:
(Note that UDT is not a public API yet.)
On Thu, May 7, 2015 at 7:11 AM, wjur
Thanks!
On Wed, May 20, 2015 at 12:41 AM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Check out Apache's trademark guidelines here:
http://www.apache.org/foundation/marks/
Matei
On May 20, 2015, at 12:02 AM, Justin Pihony justin.pih...@gmail.com
wrote:
What is the license on using
What is the license on using the spark logo. Is it free to be used for
displaying commercially?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-logo-license-tp22952.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
When running something like this:
spark-shell --jars foo.jar,bar.jar
This keeps failing to include the tail of the jars list. Digging into the
launch scripts I found that the comma makes it so that the list was sent as
separate parameters. So, to keep things together, I tried
-media-support-3.0.3.jar to
C:\Users\Justin\AppData\Local\Temp\spark-4a37d3
e9-34a2-40d4-b09b-6399931f527d\userFiles-65ee748e-4721-4e16-9fe6-65933651fec1\fetchFileTemp8970201232303518432.tmp
15/05/18 22:03:14 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
I think I found the answer -
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html
Do I have no way of running this in Windows locally?
On Mon, May 18, 2015 at 10:44 PM, Justin Pihony justin.pih...@gmail.com
wrote
I'm not 100% sure that is causing a problem, though. The stream still
starts, but is giving blank output. I checked the environment variables in
the ui and it is running local[*], so there should be no bottleneck there.
On Mon, May 18, 2015 at 10:08 PM, Justin Pihony justin.pih...@gmail.com
wrote
if the default is not used and
you provide a partition size that is larger than the block size. If my
research is right and the getSplits call simply ignores this parameter, then
wouldn't the provided min end up being ignored and you would still just get
the block size?
Thanks,
Justin
--
View
Hello,
I am using MLPipeline. I would like to extract the best parameter found by
CrossValidator. But I cannot find much document about how to do it. Can
anyone give me some pointers?
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com
: int, _2: int]
scala df.join(df2.withColumnRenamed(_1, right_key), $_1 ===
$right_key).printSchema
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html
Sent from the Apache
time of each user, and join
it back to the original data frame. But that involves two shuffles. Hence
would like to see if there are ways to improve the performance.
Thanks.
Justin
On Fri, May 15, 2015 at 6:32 AM, ayan guha guha.a...@gmail.com wrote:
can you kindly elaborate on this? it should
)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:161)
...
Thanks!
Justin
On Fri, May 15, 2015 at 3:55 PM, Michael Armbrust mich...@databricks.com
wrote:
There are several ways to solve this ambiguity:
*1. use the DataFrames to get the attribute so its already resolved and
not just a string we need
Hello,
May I know if these is way to implement aggregate function for grouped data
in DataFrame? I dug into the doc but didn't find any apart from the UDF
functions which applies on a Row. Maybe I have missed something. Thanks.
Justin
--
View this message in context:
http://apache-spark
missed something crucial. Can anyone give me some
pointers?
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Creating-StructType-with-DataFrame-withColumn-tp22715.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
After some trial and error, using DataType solves the problem:
df.withColumn(millis, $eventTime.cast(
org.apache.spark.sql.types.LongType) * 1000)
Justin
On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip yipjus...@prediction.io wrote:
Hello,
I was able to cast a timestamp into long using
to this failure?
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/casting-timestamp-into-long-fail-in-Spark-1-3-1-tp22727.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
the relevance of programmatic marketing
Thanks!
Justin
Justin Barton
CTO
+1 (718) 404 9272
+44 203 290 9272
atp.io | jus...@atp.io | find us https://atp.io/find-us
(nullable = false)
|-- SUM(_1#179): long (nullable = true)
|-- SUM(_2#180): long (nullable = true)
Thanks.
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Column-renaming-after-DataFrame-groupBy-tp22586.html
Sent from the Apache Spark User List
That explains it. Thanks Reynold.
Justin
On Mon, Apr 13, 2015 at 11:26 PM, Reynold Xin r...@databricks.com wrote:
I think what happened was applying the narrowest possible type. Type
widening is required, and as a result, the narrowest type is string between
a string and an int.
https
$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply$mcII$sp(console:51)
...
CAUSE: null
The exception cause is a null value. Is there any way that I can catch the
ArithmeticException?
Thanks
Justin
--
View this message in context:
http://apache
: string (nullable = true)
The unionAll documentation says it behaves like the SQL UNION ALL function.
However, unioning incompatible types is not well defined for SQL. Is there
any expected behavior for unioning incompatible data frames?
Thanks.
Justin
--
View this message in context:
http
, but if they are available as UDT and provided by the SparkSQL
library, that will make DataFrame users' life easier.
Justin
On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com wrote:
One possible approach can be defining a UDT (user-defined type) for Joda
time. A UDT maps an arbitrary
Hello,
The DataFrame documentation always uses $columnX to annotates a column.
But I cannot find much information about it. Maybe I have missed something.
Can anyone point me to the doc about the $, if there is any?
Thanks.
Justin
: cannot resolve 'b.b' given input
columns a, b.b;
at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
Thanks.
Justin
Thanks Michael. Will submit a ticket.
Justin
On Mon, Apr 6, 2015 at 1:53 PM, Michael Armbrust mich...@databricks.com
wrote:
I'll add that I don't think there is a convenient way to do this in the
Column API ATM, but would welcome a JIRA for adding it :)
On Mon, Apr 6, 2015 at 1:45 PM
: string (nullable = true)
The unionAll documentation says it behaves like the SQL UNION ALL function.
However, unioning incompatible types is not well defined for SQL. Is there
any expected behavior for unioning incompatible data frames?
Thanks.
Justin
is that DataFrame cannot leverage its columnar format after
persisting in memory. But cannot find anywhere from the doc mentioning this.
Did I miss anything?
Thanks!
Justin
The schema has a StructType.
Justin
On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai yh...@databricks.com wrote:
Hi Justin,
Does the schema of your data have any decimal, array, map, or struct type?
Thanks,
Yin
On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip yipjus...@prediction.io
wrote:
Hello
Thanks for the explanation Yin.
Justin
On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai yh...@databricks.com wrote:
I think the slowness is caused by the way that we serialize/deserialize
the value of a complex type. I have opened
https://issues.apache.org/jira/browse/SPARK-6759 to track
such operation?
Can't find much info about operating MapType on Column in the doc.
Thanks ahead!
Justin
.
Justin
On Fri, Apr 3, 2015 at 9:16 AM, S. Zhou myx...@yahoo.com.invalid wrote:
I am new to MLib so I have a basic question: is it possible to save MLlib
models (particularly CF models) to HDFS and then reload it later? If yes,
could u share some sample code (I could not find it in MLlib tutorial
Thanks Xiangrui,
I used 80 iterations to demonstrates the marginal diminishing return in
prediction quality :)
Justin
On Apr 2, 2015 00:16, Xiangrui Meng men...@gmail.com wrote:
I think before 1.3 you also get stackoverflow problem in ~35
iterations. In 1.3.x, please use
.
Is there any change to the ALS algorithm? And are there any ways to achieve
more iterations?
Thanks.
Justin
in the driver program.
However, both getCause() and getSuppressed() are empty.
What is the recommended way of catching this exception?
Thanks.
Justin
?
Thanks,
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Did-DataFrames-break-basic-SQLContext-tp22120.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
results in a frozen shell after this line:
INFO MetaStoreDirectSql: MySQL check failed, assuming we are not on
mysql: Lexical error at line 1, column 5. Encountered: @ (64), after :
.
which, locks the internally created metastore_db
On Wed, Mar 18, 2015 at 11:20 AM, Justin Pihony justin.pih
All,
Looking into this StackOverflow question
https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469
it appears that there is a bug when utilizing the newFilesOnly parameter in
FileInputDStream. Before creating a ticket, I wanted to verify it here. The
gist is that this
, but could not currently make that work either.
If there isn't anything in the works, then would it be appropriate to create
a ticket for this?
Thanks,
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-JSON-array-support-tp21939.html
Sent from
Thanks!
On Wed, Mar 4, 2015 at 3:58 PM, Michael Armbrust mich...@databricks.com
wrote:
It is somewhat out of data, but here is what we have so far:
https://github.com/marmbrus/sql-typed
On Wed, Mar 4, 2015 at 12:53 PM, Justin Pihony justin.pih...@gmail.com
wrote:
I am pretty sure that I
I am pretty sure that I saw a presentation where SparkSQL could be executed
with static analysis, however I cannot find the presentation now, nor can I
find any documentation or research papers on the topic. So, I am curious if
there is indeed any work going on for this topic. The two things I
or a bug?
Thanks,
Justin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-strictness-tp21650.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
OK, but what about on an action, like collect()? Shouldn't it be able to
determine the correctness at that time?
On Fri, Feb 13, 2015 at 4:49 PM, Yin Huai yh...@databricks.com wrote:
Hi Justin,
It is expected. We do not check if the provided schema matches rows since
all rows need
would pop and merge the combiners of each key in order, resulting in
[(A, 1, ...), (A, 2, ...), (A, 3, ...), (A, 4, ...)].
Thanks in advance for the help! If there is a way to do this already in
Spark 1.2, can someone point it out to me?
Best,
Justin
is trivial to runtime, and this
doesn't break any backcompat.
Thanks,
Justin
On Tue, Jan 20, 2015 at 8:03 PM, Andrew Or and...@databricks.com wrote:
Hi Justin,
I believe the intended semantics of groupByKey or cogroup is that the
ordering *within a key *is not preserved if you spill. In fact
Hello,
From accumulator documentation, it says that if the accumulator is named,
it will be displayed in the WebUI. However, I cannot find it anywhere.
Do I need to specify anything in the spark ui config?
Thanks.
Justin
Found it. Thanks Patrick.
Justin
On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell pwend...@gmail.com
wrote:
It should appear in the page for any stage in which accumulators are
updated.
On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip yipjus...@prediction.io
wrote:
Hello,
From
Xuelin,
There is a function called emtpyRDD under spark context
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
which
serves this purpose.
Justin
On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao xuelincao2...@gmail.com wrote:
Hi,
I'd like to create
)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Any ideas of what the underlying issue could be here? It feels like a
timeout issue, but that's just a guess.
Thanks, Justin
Hello,
I am looking into a couple of MLLib data files in
https://github.com/apache/spark/tree/master/data/mllib. But I cannot find
any explanation for these files? Does anyone know if they are documented?
Thanks.
Justin
Hi Shuo,
Yes. I was reading the guide as well as the sample code.
For example, in
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm,
nowhere in the github repository I can find the file: sc.textFile(
mllib/data/ridge-data/lpsa.data).
Thanks.
Justin
.
Justin
On Sun, Jun 22, 2014 at 2:40 PM, Shuo Xiang shuoxiang...@gmail.com wrote:
Hi, you might find http://spark.apache.org/docs/latest/mllib-guide.html
helpful.
On Sun, Jun 22, 2014 at 2:35 PM, Justin Yip yipjus...@gmail.com wrote:
Hello,
I am looking into a couple of MLLib data files
I see. That's good. Thanks.
Justin
On Sun, Jun 22, 2014 at 4:59 PM, Evan Sparks evan.spa...@gmail.com wrote:
Oh, and the movie lens one is userid::movieid::rating
- Evan
On Jun 22, 2014, at 3:35 PM, Justin Yip yipjus...@gmail.com wrote:
Hello,
I am looking into a couple of MLLib data
contain very few samples, which
potentially leads to overfitting.
I would like to know if there is workaround or any way to prevent
overfitting? Or will decision tree supports min-samples-per-node in future
releases?
Thanks.
Justin
to
see the same classpaths.
Is there anyway to remedy this issue?
Thanks.
Justin
98 matches
Mail list logo