Fwd: Saving RDD as Kryo (broken in 2.1)

2017-06-26 Thread Alexander Krasheninnikov
urrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- *Alexander Krasheninnikov* Head of Data Team

Saving RDD as Kryo (broken in 2.1)

2017-06-21 Thread Alexander Krasheninnikov
adPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- *Alexander Krasheninnikov* Head of Data Team

Re: JavaDStream to Dataframe: Java

2016-06-10 Thread Alexander Krasheninnikov
Hello! While operating the JavaDStream you may use a transform() or foreach() methods, which give you an access to an RDD. JavaDStream dataFrameStream = ctx.textFileStream("source").transform(new Function2() { @Override public JavaRDD call(JavaRDD incomingRdd, Time

Re: Profiling a spark job

2016-04-11 Thread Alexander Krasheninnikov
If you are profiling in standalone mode, I recommend you to try with Java Mission Control. You just need to start app with these params: -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=$YOUR_PORT

Re: problem with a very simple word count program

2015-09-16 Thread Alexander Krasheninnikov
Collect all your rdds from single files into List, then call context.union(context.emptyRdd(), YOUR_LIST); Otherwise, on greater number of elements to union, you will get stack overflow exception. On Wed, Sep 16, 2015 at 10:17 PM, Shawn Carroll wrote: > Your loop

Terminate streaming app on cluster restart

2015-08-06 Thread Alexander Krasheninnikov
Hello, everyone! I have a case, when running standalone cluster: on master stop-all.sh/star-all.sh are invoked, streaming app loses all it's executors, but does not interrupt. Since it is a streaming app, expected to get it's results ASAP, an downtime is undesirable. Is there any workaround

Re: How to set log level in spark-submit ?

2015-07-30 Thread Alexander Krasheninnikov
I saw such example in docs: --conf spark.driver.extraJavaOptions=-Dlog4j.configuration=file://$path_to_file but, unfortunately, it does not work for me. On 30.07.2015 05:12, canan chen wrote: Yes, that should work. What I mean is is there any option in spark-submit command that I can specify

Re: Count of distinct values in each column

2015-07-29 Thread Alexander Krasheninnikov
I made such naive implementation: SparkConf conf =newSparkConf(); conf.setMaster(local[4]).setAppName(Stub); finalJavaSparkContext ctx =newJavaSparkContext(conf); JavaRDDString input = ctx.textFile(path_to_file); // explode each line into list of column values JavaRDDListString rowValues =

Re: Override Logging with spark-streaming

2015-06-05 Thread Alexander Krasheninnikov
Have you tried putting this file on local disk on each of executor nodes? That worked for me. On 05.06.2015 16:56, nib...@free.fr wrote: Hello, I want to override the log4j configuration when I start my spark job. I tried : .../bin/spark-submit --class --conf

Re: kafka + Spark Streaming with checkPointing fails to start with

2015-05-15 Thread Alexander Krasheninnikov
I had same problem. The solution, I've found was to use: JavaStreamingContext streamingContext = JavaStreamingContext.getOrCreate('checkpoint_dir', contextFactory); ALL configuration should be performed inside contextFactory. If you try to configure streamContext after ::getOrCreate, you

Streaming app with windowing and persistence

2015-04-27 Thread Alexander Krasheninnikov
Hello, everyone. I develop stream application, working with window functions - each window create table and perform some SQL-operations on extracted data. I met such problem: when using window operations and checkpointing, application does not start next time. Here is the code: