Unsubscribe

2023-12-12 Thread Sergey Boytsov
Unsubscribe --

unsubscribe

2023-12-11 Thread Sergey Boytsov
-- Sergei Boitsov JetBrains GmbH Christoph-Rapparini-Bogen 23 80639 München Handelsregister: Amtsgericht München, HRB 187151 Geschäftsführer: Yury Belyaev

Re: [Building] Building with JDK11

2022-07-18 Thread Sergey B.
Hi Steve, Can you shed some light why do they need $JAVA_HOME at all if everything is already in place? Regards, - Sergey On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy wrote: > Hi Szymon, > > There seems to be a common misconception that setting JAVA_HOME will set > the ver

Re: Spark sql slowness in Spark 3.0.1

2022-04-14 Thread Sergey B.
The suggestion is to check: 1. Used format for write 2. Used parallelism On Thu, Apr 14, 2022 at 7:13 PM Anil Dasari wrote: > Hello, > > > > We are upgrading spark from 2.4.7 to 3.0.1. we use spark sql (hive) to > checkpoint data frames (intermediate data). DF write is very slow in 3.0.1 >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
sing your question. > I do not think it's a bug necessarily; do you end up with one partition in > your execution somewhere? > > On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Of course if I give 64G of ram to each executor they w

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
damages arising from such > loss, damage or destruction. > > > > On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Yes, in fact those are the settings that cause this behaviour. If set to > false, eve

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
deserialization of Row objects. Best regards, Sergey Ivanychev > 12 нояб. 2021 г., в 05:05, Gourav Sengupta > написал(а): >  > Hi Sergey, > > Please read the excerpts from the book of Dr. Zaharia that I had sent, they > explain these fundamentals clearly. > >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
Yes, in fact those are the settings that cause this behaviour. If set to false, everything goes fine since the implementation in spark sources in this case is pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) Best regards, Sergey Ivanychev > 11 нояб. 2021 г., в 13

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> did you get to read the excerpts from the book of Dr. Zaharia? I read what you have shared but didn’t manage to get your point. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:38, Gourav Sengupta > написал(а): > > did you get to read the excerpts from the book of Dr. Zaharia?

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> Just to confirm with Collect() alone, this is all on the driver? I shared the screenshot with the plan in the first email. In the collect() case the data gets fetched to the driver without problems. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:37, Mich Talebzadeh >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
on executors. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 15:17, Mich Talebzadeh > написал(а): > >  > > From your notes ".. IIUC, in the `toPandas` case all the data gets shuffled > to a single executor that fails with OOM, which doesn’t happen in `collect` >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
in execution plans. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 13:12, Mich Talebzadeh > написал(а): > >  > Do you have the output for executors from spark GUI, the one that eventually > ends up with OOM? > > Also what does > > kubectl get pods -n

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
as the driver. Currently, the best solution I found is to write the dataframe to S3, and then read it via pd.read_parquet. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 00:18, Mich Talebzadeh > написал(а): > >  > Thanks for clarification on the koalas case. >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
at RDD. Also toPandas() converts to Python objects in memory I do not think > that collect does it. > > Regards, > Gourav > > On Wed, Nov 3, 2021 at 2:24 PM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Hi, > > Spark 3.1.2 K8s. >

Using Spark as a fail-over platform for Java app

2021-03-12 Thread Sergey Oboguev
I have an existing plain-Java (non-Spark) application that needs to run in a fault-tolerant way, i.e. if the node crashes then the application is restarted on another node, and if the application crashes because of internal fault, the application is restarted too. Normally I would run it in a

Detecting latecomer events in Spark structured streaming

2021-03-08 Thread Sergey Oboguev
I have a Spark structured streaming based application that performs window(...) construct followed by aggregation. This construct discards latecomer events that arrive past the watermark. I need to be able to detect these late events to handle them out-of-band. The application maintains a raw

Controlling Spark StateStore retention

2021-02-20 Thread Sergey Oboguev
I am trying to write a Spark Structured Streaming application consisting of GroupState construct followed by aggregation. Events arriving from event sources are bucketized by deviceId and quantized timestamp, composed together into group state key idTime. Logical plan consists of stages (in the

Re: Excessive disk IO with Spark structured streaming

2020-10-07 Thread Sergey Oboguev
them in terms of threading model? Will there be a separate thread per active stream/query or does Spark use a bounded thread pool? Do many streams/queries result in many threads, or a limited number of threads? Thanks, Sergey

Re: Excessive disk IO with Spark structured streaming

2020-10-05 Thread Sergey Oboguev
DFS server implements the calls), but it does matter when a file system like Lustre or NFS is used for checkpoint storage. (Not to mention spawning readlink and chmod does not seem like a bright idea in the first place, although perhaps there might be a reason why Hadoop layer does it this wa

Excessive disk IO with Spark structured streaming

2020-10-04 Thread Sergey Oboguev
I am trying to run a Spark structured streaming program simulating basic scenario of ingesting events and calculating aggregates on a window with watermark, and I am observing an inordinate amount of disk IO Spark performs. The basic structure of the program is like this: sparkSession =

Re: Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev
Thanks! It appears one should use not *dataset.col("timestamp")* but rather* functions.col("timestamp").*

Spark watermarked aggregation query and append output mode

2020-09-23 Thread Sergey Oboguev
Hi, I am trying to aggregate Spark time-stamped structured stream to get per-device (source) averages for every second of incoming data. dataset.printSchema(); // see the output below Dataset ds1 = dataset .withWatermark("timestamp", "1 second")

Re: Dynamic metric names

2019-05-06 Thread Sergey Zhemzhitsky
all the corresponding metrics statically. Kind Regards, Sergey On Mon, May 6, 2019, 16:07 Saisai Shao wrote: > I remembered there was a PR about doing similar thing ( > https://github.com/apache/spark/pull/18406). From my understanding, this > seems like a quite specific requiremen

Dynamic metric names

2019-05-04 Thread Sergey Zhemzhitsky
Hello Spark Users! Just wondering whether it is possible to register a metric source without metrics known in advance and add the metrics themselves to this source later on? It seems that currently MetricSystem puts all the metrics from the source's MetricRegistry into a shared MetricRegistry of

Re: FW: Spark2 and Hive metastore

2018-11-12 Thread Sergey B.
In order for the Spark to see Hive metastore you need to build Spark Session accordingly: val spark = SparkSession.builder() .master("local[2]") .appName("myApp") .config("hive.metastore.uris","thrift://localhost:9083") .enableHiveSupport() .getOrCreate() On Mon, Nov 12, 2018 at 11:49

Re: Accumulator guarantees

2018-05-10 Thread Sergey Zhemzhitsky
/a6fc300e91273230e7134ac6db95ccb4436c6f8f/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L36 [3] https://github.com/apache/spark/blob/3990daaf3b6ca2c5a9f7790030096262efb12cb2/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1204 On Thu, May 10, 2018 at 10:24 PM, Sergey Zhemzhitsky <sz

Accumulator guarantees

2018-05-10 Thread Sergey Zhemzhitsky
Hi there, Although Spark's docs state that there is a guarantee that - accumulators in actions will only be updated once - accumulators in transformations may be updated multiple times ... I'm wondering whether the same is true for transformations in the last stage of the job or there is a

Re: AccumulatorV2 vs AccumulableParam (V1)

2018-05-04 Thread Sergey Zhemzhitsky
latorsspec-scala-L59 [4] https://github.com/apache/spark/blob/4d5de4d303a773b1c18c350072344bd7efca9fc4/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L51 Kind Regards, Sergey On Thu, May 3, 2018 at 5:20 PM, Wenchen Fan <cloud0...@gmail.com> wrote: > Hi Sergey, > >

AccumulatorV2 vs AccumulableParam (V1)

2018-05-02 Thread Sergey Zhemzhitsky
Hello guys, I've started to migrate my Spark jobs which use Accumulators V1 to AccumulatorV2 and faced with the following issues: 1. LegacyAccumulatorWrapper now requires the resulting type of AccumulableParam to implement equals. In other case the AccumulableParam, automatically wrapped into

Re: DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
the job completes successfully On Wed, Mar 28, 2018 at 10:31 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Encoding issue of the data? Eg spark uses utf-8 , but source encoding is > different? > >> On 28. Mar 2018, at 20:25, Sergey Zhemzhitsky <szh.s...@gmail.com> wrot

DataFrames :: Corrupted Data

2018-03-28 Thread Sergey Zhemzhitsky
Hello guys, I'm using Spark 2.2.0 and from time to time my job fails printing into the log the following errors scala.MatchError: profiles.total^@^@f2-a733-9304fda722ac^@^@^@^@profiles.10361.10005^@^@^@^@.total^@^@0075^@^@^@^@ scala.MatchError: pr^?files.10056.10040 (of class java.lang.String)

Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

2017-12-08 Thread Sergey Zhemzhitsky
Hi PySparkers, What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies? There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], [7] regarding setting up the python environment with conda and virtualenv

Best way of shipping self-contained pyspark jobs with 3rd-party dependencies

2017-12-07 Thread Sergey Zhemzhitsky
Hi PySparkers, What currently is the best way of shipping self-contained pyspark jobs with 3rd-party dependencies? There are some open JIRA issues [1], [2] as well as corresponding PRs [3], [4] and articles [5], [6], regarding setting up the python environment with conda and virtualenv

What is the purpose of having RDD.context and RDD.sparkContext at the same time?

2017-06-27 Thread Sergey Zhemzhitsky
/org/apache/spark/rdd/RDD.scala#L1693 [2] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L146 Kind Regards, Sergey

Re: Is GraphX really deprecated?

2017-05-16 Thread Sergey Zhemzhitsky
lace, respectively. Jacek On 13 May 2017 3:00 p.m., "Sergey Zhemzhitsky" <szh.s...@gmail.com> wrote: > Hello Spark users, > > I just would like to know whether the GraphX component should be > considered deprecated and no longer actively maintained > and should not

Is GraphX really deprecated?

2017-05-13 Thread Sergey Zhemzhitsky
%22Not%20A%20Bug%22%2C%20%22Not%20A%20Problem%22)%20ORDER%20BY%20created%20DESC So, I'm wondering what the community who uses GraphX, and commiters who develop it think regarding this Spark component? Kind regards, Sergey - To u

Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev
, Saisai Shao wrote: AFAIK, For the first line, custom filter should be worked. But for the latter it is not supported. On Fri, Apr 14, 2017 at 6:17 PM, Sergey Grigorev <grigorev-...@yandex.ru <mailto:grigorev-...@yandex.ru>> wrote: GET requests like *http://worker:4040/api/v1/

Re: Spark API authentication

2017-04-14 Thread Sergey Grigorev
specifically are you referring to "Spark API endpoint"? Filter can only be worked with Spark Live and History web UI. On Fri, Apr 14, 2017 at 5:18 PM, Sergey <grigorev-...@yandex.ru <mailto:grigorev-...@yandex.ru>> wrote: Hello all, I've added own spark.ui.f

Spark API authentication

2017-04-14 Thread Sergey
Hello all, I've added own spark.ui.filters to enable basic authentication to access to Spark web UI. It works fine, but I still can do requests to spark API without any authentication. Is there any way to enable authentication for API endpoints? P.S. spark version is 2.1.0, deploy mode is

Re: apache-spark doesn't work correktly with russian alphabet

2017-01-18 Thread Sergey B.
​Try to make encoding right. E.g,, if you read from `csv` or other sources, specify encoding, which is most probably `cp1251`: df = sqlContext.read.csv(filePath, encoding="cp1251") On Linux cli encoding can be found with `chardet` utility​ On Wed, Jan 18, 2017 at 3:53 PM, AlexModestov

Re: Adding Hive support to existing SparkSession (or starting PySpark with Hive support)

2016-12-19 Thread Sergey B.
I have a asked a similar question here http://stackoverflow.com/questions/40701518/spark-2-0-redefining-sparksession-params-through-getorcreate-and-not-seeing-cha Please see the answer, basically stating that it's impossible to change Session config as soon as it was initiated On Mon, Dec 19,

MapWithState would not restore from checkpoint.

2016-06-27 Thread Sergey Zelvenskiy
MapWithState would not restore from checkpoint. MapRDD code requires non empty spark contexts, while the context is empty. ERROR 2016-06-27 11:06:33,236 0 org.apache.spark.streaming.StreamingContext [run-main-0] Error starting the context, marking it as stopped org.apache.spark.SparkException:

RDDs caching in typical machine learning use cases

2016-04-03 Thread Sergey
Hi Spark ML experts! Do you use RDDs caching somewhere together with ML lib to speed up calculation? I mean typical machine learning use cases. Train-test split, train, evaluate, apply model. Sergey.

Random forest implementation details

2016-04-03 Thread Sergey
=256, cacheNodeIds=False, checkpointInterval=10 Sergey

SocketTimeoutException

2016-04-01 Thread Sergey
= pycsv.csvToDataFrame(sql, plaintext_rdd) csv file size is 50MB. I use Spark 1.6.1 on Windows locally. Regards, Sergey. Py4JJavaError Traceback (most recent call last) in () 2 plaintext_rdd = sc.textFile(r'file:///c:\data\sample.csv') 3

strange behavior of pyspark RDD zip

2016-04-01 Thread Sergey
Hi! I'm on Spark 1.6.1 in local mode on Windows. And have issue with zip of zip'pping of two RDDs of __equal__ size and __equal__ partitions number (I also tried to repartition both RDDs to one partition). I get such exception when I do rdd1.zip(rdd2).count(): File

pySpark window functions are not working in the same way as Spark/Scala ones

2015-09-03 Thread Sergey Shcherbakov
quot;prev"), df("b"), lead(df("b"),1).over(wSpec).alias("next")) Am I doing anything wrong or this is a pySpark issue indeed? Best Regards, Sergey PS: Here is the full pySpark shell example: from pyspark.sql.window import Window import pyspark.sql.functions as func l = [

Re: Spark UI tunneling

2015-03-23 Thread Sergey Gerasimov
Akhil, that's what I did. The problem is that probably web server tried to forward my request to another address accessible locally only. 23 марта 2015 г., в 11:12, Akhil Das ak...@sigmoidanalytics.com написал(а): Did you try ssh -L 4040:127.0.0.1:4040 user@host Thanks Best Regards

Stand-alone Spark on windows

2015-02-26 Thread Sergey Gerasimov
in java.lang.ProcessBuilder.start(). What could be wrong? Should I start some scripts before spark-submit? I have windows 7 and spark 1.2.1 Sergey. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional

Re: Kafka - streaming from multiple topics

2014-07-07 Thread Sergey Malov
: Thursday, July 3, 2014 at 9:41 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Kafka - streaming from multiple topics Sergey, On Fri, Jul 4, 2014 at 1:06 AM, Sergey Malov sma...@collective.commailto:sma...@collective.com

Re: Kafka - streaming from multiple topics

2014-07-03 Thread Sergey Malov
, calls ConsumerConnector.createMessageStream which returns a Map[String, List[KafkaStream] keyed by topic. It is, however, not exposed. So Kafka does provide capability of creating multiple streams based on topic, but Spark doesn’t use it, which is unfortunate. Sergey From: Tobias Pfeiffer t

Kafka - streaming from multiple topics

2014-07-02 Thread Sergey Malov
” topic or “datapair”, set up different filter function, map function, reduce function, etc. Is it possible ? I’d assume it should be, since ConsumerConnector can map of KafkaStreams keyed on topic, but I can’t find that it would be visible to Spark. Thank you, Sergey Malov

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Sergey Parhomenko
on how to avoid this issue in 0.9.0? -- Best regards, Sergey Parhomenko On 5 March 2014 14:40, Sean Owen so...@cloudera.com wrote: Yes I think that issue is fixed (Patrick you had the last eyes on it IIRC?) If you are using log4j, in general, do not redirect log4j to slf4j. Stuff using log4j

Re: Unable to redirect Spark logs to slf4j

2014-03-05 Thread Sergey Parhomenko
now with the patched module. I'm not sure however if using compile scope for slf4j-log4j12 is a good idea. -- Best regards, Sergey Parhomenko On 5 March 2014 20:11, Patrick Wendell pwend...@gmail.com wrote: Hey All, We have a fix for this but it didn't get merged yet. I'll put it as a blocker