Greetings!
We're reading input files with newApiHadoopFile that is configured with
multiline split. Everything's fine, besides
https://issues.apache.org/jira/browse/MAPREDUCE-6549. It looks like the
issue is fixed, but within hadoop 2.7.2. Which means we have to download
spark without hadoop and
Apurva,
I'd say you have to apply repartition just once to the RDD that is union of
all your files.
And it has to be done right before you do anything else.
If something is not needed on your files, then the sooner you project, the
better.
Hope, this helps.
--
Be well!
Jean Morozov
On Tue,
Marco,
I'd say yes, because it uses different implementation of hadoop's
InputFormat interface underneath.
What kind of proof would you like to see?
--
Be well!
Jean Morozov
On Sun, Jun 5, 2016 at 12:50 PM, Marco Capuccini <
marco.capucc...@farmbio.uu.se> wrote:
> Dear all,
>
> Does Spark uses
Everett,
try to increase thread stack size. To do that run your application with the
following options (my app is a web application, so you might adjust
something): -XX:ThreadStackSize=81920
-Dspark.executor.extraJavaOptions="-XX:ThreadStackSize=81920"
The number 81920 is memory in KB. You could
Hi,
Yes, I believe people do that. I also believe that SparkML is able to
figure out when to cache some internal RDD also. That's definitely true for
random forest algo. It doesn't harm to cache the same RDD twice, too.
But it's not clear what'd you want to know...
--
Be well!
Jean Morozov
On
mentation (and
> any PLANET-like implementation)
>
> Using fewer partitions is a good idea.
>
> Which Spark version was this on?
>
> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> The questions I have in mind:
>>
&
increases over time.
When the warning appeared first time it was around 100KB.
Also time to complete collectAsMap at DecisionTree.scala:651 also increased
from 8 seconds at the beginning of the training up to 20-24 seconds now.
--
Be well!
Jean Morozov
On Wed, Mar 30, 2016 at 12:14 AM, Eugene Morozov
od idea.
>
> Which Spark version was this on?
>
> On Tue, Mar 29, 2016 at 5:21 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> The questions I have in mind:
>>
>> Is it smth that the one might expect? From the stack trace itself it's
>>
the right thing to do, but I've
increased thread stack size 10 times (to 80MB)
reduced default parallelism 10 times (only 20 cores are available)
Thank you in advance.
--
Be well!
Jean Morozov
On Tue, Mar 29, 2016 at 1:12 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:
>
Hi,
I have a web service that provides rest api to train random forest algo.
I train random forest on a 5 nodes spark cluster with enough memory -
everything is cached (~22 GB).
On a small datasets up to 100k samples everything is fine, but with the
biggest one (400k samples and ~70k features)
Could you, pls share your code, so that I could try it.
--
Be well!
Jean Morozov
On Sun, Mar 27, 2016 at 5:20 PM, 吴文超 wrote:
> I am a newbie to spark, when I use IntelliJ idea to write some scala code,
> i found it reports error when using spark's implicit
t;
> Joseph
>
> On Mon, Dec 14, 2015 at 10:52 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
>
>> Hello!
>>
>> I'm currently working on POC and try to use Random Forest (classification
>> and regression). I also have to check SVM and Mul
Hi,
I have 4 nodes cluster: one master (also has hdfs namenode) and 3 workers
(also have 3 colocated hdfs datanodes). Each worker has only 2 cores and
spark.executor.memory is 2.3g.
Input file is two hdfs blocks, one block configured = 64MB.
I train random forest regression with numTrees=50 and
I haven't added one more HDFS node to a hadoop cluster
>
> Does each of three nodes colocate with hdfs data nodes ?
> The absence of 4th data node might have something to do with the partition
> allocation.
>
> Can you show your code snippet ?
>
> Thanks
>
> On Sat,
Hi,
My cluster (standalone deployment) consisting of 3 worker nodes was in the
middle of computations, when I added one more worker node. I can see that
new worker is registered in master and that my job actually get one more
executor. I have configured default parallelism as 12 and thus I see
lgorithm. There is
> no pre-emption or rescheduling of Tasks that the scheduler has already sent
> to the workers, nor is there any attempt to anticipate when already running
> Tasks will complete.
>
>
> On Sat, Feb 20, 2016 at 4:14 PM, Eugene Morozov <
> evgeny.a.moro...
Hi everyone.
I have a requirement to run prediction for random forest model locally on a
web-service without touching spark at all in some specific cases. I've
achieved that with previous mllib API (java 8 syntax):
public List> predictLocally(RandomForestModel
model,
Hi,
I'm trying to understand how this thing works underneath. Let's say I have
two types of jobs - high important, that might use small amount of cores
and has to be run pretty fast. And less important, but greedy - uses as
many cores as available. So, the idea is to use two corresponding pools.
Hi,
I have several instances of the same web-service that is running some ML
algos on Spark (both training and prediction) and do some Spark unrelated
job. Each web-service instance creates their on JavaSparkContext, thus
they're seen as separate applications by Spark, thus they're configured
Hello,
I'm building simple web service that works with spark and allows users to
train random forest model (mlib API) and use it for prediction. Trained
models are stored on the local file system (web service and spark of just
one worker are run on the same machine).
I'm concerned about
(ScalaReflection.scala:642)
~[spark-catalyst_2.10-1.6.0.jar:1.6.0]
--
Be well!
Jean Morozov
On Fri, Feb 12, 2016 at 5:57 PM, Eugene Morozov <evgeny.a.moro...@gmail.com>
wrote:
> Hello,
>
> I'm building simple web service that works with spark and allows users to
> train random forest model
Emlyn,
Have you considered using pools?
http://spark.apache.org/docs/latest/job-scheduling.html#fair-scheduler-pools
I haven't tried that by myself, but it looks like pool setting is applied
per thread so that means it's possible to configure fair scheduler, so that
more, than one job is on a
Praveen,
Zeppelin uses Spark's REPL.
I'm currently writing an app that is a web service, which is going to run
spark jobs.
So, at the init stage I just create JavaSparkContext and then use it for
all users requests. Web service is stateless. The issue with stateless is
that it's possible to run
,
>
> Try this:
>
> df.select("""select * from tmptable where x1 = '3.0'""").show();
>
>
> *Note: *you have to use 3 double quotes as marked
>
>
>
> On Friday, December 25, 2015 11:30 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wro
Kendal,
have you tried to reduce number of partitions?
--
Be well!
Jean Morozov
On Mon, Dec 28, 2015 at 9:02 AM, kendal wrote:
> My driver is running OOM with my 4T data set... I don't collect any data to
> driver. All what the program done is map - reduce - saveAsTextFile.
>> https://github.com/apache/incubator-zeppelin/blob/01f4884a3a971ece49d668a9783d6b705cf6dbb5/spark/src/main/java/org/apache/zeppelin/spark/SparkSqlInterpreter.java#L140-L141
>>
>>
>> Also, keep in mind that you can do something like this if you want to
>> stay in DataFram
Hello, I'm basically stuck as I have no idea where to look;
Following simple code, given that my Datasource is working gives me an
exception.
DataFrame df = sqlc.load(filename, "com.epam.parso.spark.ds.DefaultSource");
df.cache();
df.printSchema(); <-- prints the schema perfectly fine!
from SQL query.
> I searched unit tests but didn't find any in the form of df.select("select
> ...")
>
> Looks like you should use sqlContext as other people suggested.
>
> On Fri, Dec 25, 2015 at 8:29 AM, Eugene Morozov <
> evgeny.a.moro...@gmail.com> wrote:
&g
ail.com>
> wrote:
>
>> hello
>> you can try to use df.limit(5).show()
>> just trick :)
>>
>> On Fri, Dec 25, 2015 at 2:34 PM, Eugene Morozov <
>> evgeny.a.moro...@gmail.com> wrote:
>>
>>> Hello, I'm basically stuck as I have no idea w
Hi!
I'm looking for a way to run prediction for learned model in the most
performant way. It might happen that some users might want to predict just
couple of samples (literally one or two), but some other would run
prediction for tens of thousands. It's not a surprise there is an overhead
to
Hello!
I'm currently working on POC and try to use Random Forest (classification
and regression). I also have to check SVM and Multiclass perceptron (other
algos are less important at the moment). So far I've discovered that Random
Forest has a limitation of maxDepth for trees and just out of
Hello,
I'm using RandomForest pipeline (ml package). Everything is working fine
(learning models, prediction, etc), but I'd like to tune it for the case,
when I predict with small dataset.
My issue is that when I apply
(PipelineModel)model.transform(dataset)
The model consists of the following
ithTheOriginalLabels)
> .setLabels(labelIndexer.labels)
>
> val pipeline = new Pipeline()
> .setStages(Array(labelIndexer, randomForest, labelConverter))
>
> Hoping that helps,
> Ben.
>
> On Sat, Dec 5, 2015 at 12:26 PM, Eugene Morozov <
> evgeny.a.moro
create your own map and reverse map of (label to index) and
> (index to label) and use this for getting back your original label.
>
> May be there is better way to do this..
>
> Regards,
> Vishnu
>
> On Fri, Dec 4, 2015 at 4:56 PM, Eugene Morozov <evgeny.a.moro...@gmail.c
Hello,
I've got an input dataset of handwritten digits and working java code that
uses random forest classification algorithm to determine the numbers. My
test set is just some lines from the same input dataset - just to be sure
I'm doing the right thing. My understanding is that having correct
Hi,
I have a DataFrame with several columns I'd like to explode. All of the
columns I have to explode has an ArrayBuffer type of some other types
inside.
I'd say that the following code is totally legit to use it as explode
function for any given ArrayBuffer - my assumption is that for any given
.
--
Be well!
Jean Morozov
On Tue, Oct 6, 2015 at 1:58 AM, Davies Liu <dav...@databricks.com> wrote:
> Could you tell us a way to reproduce this failure? Reading from JSON or
> Parquet?
>
> On Mon, Oct 5, 2015 at 4:28 AM, Eugene Morozov
> <evgeny.a.moro...@gmail.com> w
Hi,
We're building our own framework on top of spark and we give users pretty
complex schema to work with. That requires from us to build dataframes by
ourselves: we transform business objects to rows and struct types and uses
these two to create dataframe.
Everything was fine until I started to
Hi,
I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm
trying to save my data frame to parquet.
The issue I'm stuck looks like serialization tries to do pretty weird
thing: tries to write to an empty array.
The last (through stack trace) line of spark code that leads to
(nullable = true)
||-- e: string (nullable = true)
help me.
Regards,
Rishabh.
Eugene Morozov
fathers...@list.ru
, stored partitions have to be deleted somehow. How is
that happened?
--
Eugene Morozov
fathers...@list.ru
List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Eugene Morozov
fathers...@list.ru
java.nio.channels.SocketChannel
Probably it's hitting a race condition.
Has anyone else faced this situation? Any suggestions?
Thanks a lot!
On 15 July 2015 at 14:04, Eugene Morozov fathers...@list.ru wrote:
Yiannis ,
It looks like you might explore other approach.
sc.textFile
?
Hopefully I have described the issue clearly, and please
feel free to correct me if have done something wrong, thanks a lot.
Eugene Morozov
fathers...@list.ru
Hi!
I’d like to complete action (store / print smth) inside of transformation (map
or mapPartitions). This approach has some flaws, but there is a question. Might
it happen that Spark will optimise (RDD or DataFrame) processing so that my
mapPartitions simply won’t happen?
--
Eugene Morozov
by a key that I'm already partitioned by?
- Philip
Eugene Morozov
fathers...@list.ru
constructor for the class C and deserialization is broken with
an invalid constructor exception.
I think it's a common use case. Any help is appreciated.
--
Hao Ren
Data Engineer @ leboncoin
Paris, France
Eugene Morozov
fathers...@list.ru
the in-between data values.
Regards,
Deepesh
Eugene Morozov
fathers...@list.ru
://sujee.net | http://www.linkedin.com/in/sujeemaniyam )
--
Sujee Maniyam (http://sujee.net | http://www.linkedin.com/in/sujeemaniyam )
Eugene Morozov
fathers...@list.ru
, and second time in properties
file, which looks weird and unclear to as why I should do that.
What is the reason for it? I thought the jar file has to be copied into all
Worker nodes (or else it’s not possible to run the job on Wokrers). Can anyone
shed some light on this?
Thanks
--
Eugene
might explain, why KryoRegistrator is not being found on Worker -
there are no functions, which use it directly, so it never copied to Workers.
Could you, please, explain of how code is end up on Worker or give me a hint
where I can find it in the sources?
On 08 Jul 2015, at 17:40, Eugene Morozov
Spark does reshuffle. Why it does so?
Thanks in advance.
--
Eugene Morozov
fathers...@list.ru
)
at
com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:67)
Eugene Morozov
fathers...@list.ru
. And unfortunately it’s not possible to cast this column
as cast string to struct is not allowed.
Are there any workarounds to have correct schema?
Thanks in advance.
Eugene Morozov
fathers...@list.ru
be
implementation of DataFrame itself provides some sort of custom types or smth
pluggable that I might consider.
Any clue would be really appreciated.
Thanks
--
Eugene Morozov
fathers...@list.ru
55 matches
Mail list logo