Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
med. I don't think that supports such a drastic > statement. > > On Wed, Jun 22, 2022 at 12:39 PM Igor Berman > wrote: > >> Hi All >> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >> everybody will understand consequences of usage of t

repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
Hi All tldr; IMHO repartition(n) should be deprecated or red-flagged, so that everybody will understand consequences of usage of this method Following conversation in https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for recent versions of spark) I think it's very important to

Initial job has not accepted any resources

2017-01-04 Thread Igor Berman
Hi All, need your advice: we see in some very rare cases following error in log Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources and in spark UI there are idle workers and application in WAITING state in json

Re: need help to have a Java version of this scala script

2016-12-17 Thread Igor Berman
do you mind to show what you have in java? in general $"bla" is col("bla") as soon as you import appropriate function import static org.apache.spark.sql.functions.callUDF; import static org.apache.spark.sql.functions.col; udf should be callUDF e.g. ds.withColumn("localMonth",

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Igor Berman
Takeshi, why are you saying this, how have you checked it's only used from 2.7.3? We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we use this setting. We've sort of "verified" it's used by configuring log of file output commiter On 30 September 2016 at 03:12, Takeshi

Re: Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-28 Thread Igor Berman
Michael, can you explain please why bucketBy is supported when using writeAsTable() to parquet by not with parquet() Is it only difference between table api and dataframe/dataset api? or there are some other? org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at

Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation? On 15 September 2016 at 21:37, Chen, Kevin wrote: > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? > My spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a

Re: Using spark to distribute jobs to standalone servers

2016-08-25 Thread Igor Berman
imho, you'll need to implement custom rdd with your locality settings(i.e. custom implementation of discovering where each partition is located) + setting for spark.locality.wait On 24 August 2016 at 03:48, Mohit Jaggi wrote: > It is a bit hacky but possible. A lot depends

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-20 Thread Igor Berman
in addition check what ip the master is binding to(with nestat) On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > Troubleshooting steps: > > $ telnet localhost 7077 (on master, to confirm port is open) > $ telnet 7077 (on slave, to confirm port is blocked) > > If the port

streaming new data into bigger parquet file

2016-07-06 Thread Igor Berman
Hi I was reading following tutorial https://docs.cloud.databricks.com/docs/latest/databricks_guide/07%20Spark%20Streaming/08%20Write%20Output%20To%20S3.html of streaming data to s3 of databricks_guide and it states that sometimes I need to do compaction of small files(e.g. from spark streaming)

Re: Spark streaming readind avro from kafka

2016-06-01 Thread Igor Berman
Avro file contains metadata with schema(writer schema) in Kafka there is no such thing, you should put message that will contain some reference to known schema(put whole schema will have big overhead) some people use schema registry solution On 1 June 2016 at 21:02, justneeraj

different SqlContext with same udf name with different meaning

2016-05-08 Thread Igor Berman
Hi, suppose I have multitenant environment and I want to give my users additional functions but for each user/tenant the meaning of same function is dependent on user's specific configuration is it possible to register same function several times under different SqlContexts? are several

Re: Apache Flink

2016-04-17 Thread Igor Berman
latency in Flink is not eliminated, but it might be smaller since Flink process each event 1-by-1 while Spark does microbatching(so you can't achieve latency lesser than your microbatch config) probably Spark will have better throughput due to this microbatching On 17 April 2016 at 14:47,

Re: streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Igor Berman
may be you are experiencing problem with FileOutputCommiter vs DirectCommiter while working with s3? do you have hdfs so you can try it to verify? committing in s3 will copy 1-by-1 all partitions to your final destination bucket from _temporary, so this stage might become a bottleneck(so reducing

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-05 Thread Igor Berman
it's not safe to use direct committer with append mode, you may loose your data.. On 4 March 2016 at 22:59, Jelez Raditchkov wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Igor Berman
your field name is *enum1_values* but you have data { "foo1": "test123", *"enum1"*: "BLUE" } i.e. since you defined enum and not union(null, enum) it tries to find value for enum1_values and doesn't find one... On 3 March 2016 at 11:30, Chris Miller wrote: > I've been

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Igor Berman
spark.driver.extraClassPath spark.executor.extraClassPath 2016-03-02 18:01 GMT+02:00 Matthias Niehoff : > Hi, > > we want to add jars to the Master and Worker class path mainly for logging > reason (we have a redis appender to send logs to redis -> logstash -> >

Re: .cache() changes contents of RDD

2016-02-27 Thread Igor Berman
are you using avro format by any chance? there is some formats that need to be "deep"-copy before caching or aggregating try something like val input = sc.newAPIHadoopRDD(...) val rdd = input.map(deepCopyTransformation).map(...) rdd.cache() rdd.saveAsTextFile(...) where deepCopyTransformation is

Re: DirectFileOutputCommiter

2016-02-27 Thread Igor Berman
. it won't be "silent" failure. On 26 February 2016 at 11:50, Reynold Xin <r...@databricks.com> wrote: > It could lose data in speculation mode, or if any job fails. > > On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman <igor.ber...@gmail.com> > wrote: > >> Take

Re: Bug in DiskBlockManager subDirs logic?

2016-02-26 Thread Igor Berman
I've experienced such kind of outputs when executor was killed(e.g. by OOM killer) or was lost for some reason i.e. try to look at machine if executor wasn't restarted... On 26 February 2016 at 08:37, Takeshi Yamamuro wrote: > Hi, > > Could you make simple codes to

Re: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Igor Berman
Imho most of production clusters are standalone there was some presentation from spark summit with some stats inside(can't find right now), so standalone was at 1st place it was from Matei https://databricks.com/resources/slides On 26 February 2016 at 13:40, Petr Novak

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Takeshi, do you know the reason why they wanted to remove this commiter in SPARK-10063? the jira has no info inside as far as I understand the direct committer can't be used when either of two is true 1. speculation mode 2. append mode(ie. not creating new version of data but appending to existing

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Alexander, implementation you've attaches supports both modes configured by property " mapred.output.direct." + fs.getClass().getSimpleName() as soon as you see _temporary dir probably the mode is off i.e. the default impl is working and you experiencing some other problem. On 26 February 2016 at

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
the performance gain is for commit stage when data is moved from _temporary directory to distination directory since s3 is key-value really the move operation is like copy operation On 26 February 2016 at 08:24, Takeshi Yamamuro wrote: > Hi, > > Great work! > What is the

Re: reasonable number of executors

2016-02-23 Thread Igor Berman
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications there is a section that is connected to your question On 23 February 2016 at 16:49, Alex Dzhagriev wrote: > Hello all, > > Can someone please advise me on the pros and cons on

Re: SPARK-9559

2016-02-18 Thread Igor Berman
what are you trying to solve? killing worker jvm is like killing yarn node manager...why would you do this? usually worker jvm is "agent" on each worker machine which opens executors per each application, so it doesn't works hard or has big memory footprint yes it can fail, but it rather corner

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Igor Berman
String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/*json” * not sure, but can you try to remove s3-us-west-1.amazonaws.com from path ? On 11 February 2016 at 23:15, Andy Davidson wrote: > I am

Re: Shuffle memory woes

2016-02-08 Thread Igor Berman
gt;> >>> Hi,Corey: >>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >>> your input files lzo format, and you use sc.text() ? If memory is not >>> enough, spark will spill 3-4x of input data to disk. >>> >>> &g

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman
show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liu wrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > ​ > > On Sun, Feb 7, 2016 at

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman
f magnitude when it needs to spill multiple times (it ends up spilling > several hundred for me when it can't fit stuff into memory). > > > > On Sat, Feb 6, 2016 at 6:40 AM, Igor Berman <igor.ber...@gmail.com> wrote: > >> Hi, >> usually you can solve this by 2 steps &g

Re: Shuffle memory woes

2016-02-06 Thread Igor Berman
Hi, usually you can solve this by 2 steps make rdd to have more partitions play with shuffle memory fraction in spark 1.6 cache vs shuffle memory fractions are adjusted automatically On 5 February 2016 at 23:07, Corey Nolet wrote: > I just recently had a discovery that my

Re: multi-threaded Spark jobs

2016-01-25 Thread Igor Berman
IMHO, you are making mistake. spark manages tasks and cores internally. when you open new threads inside executor - meaning you "over-provisioning" executor(e.g. tasks on other cores will be preempted) On 26 January 2016 at 07:59, Elango Cheran wrote: > Hi everyone, >

Re: Converting CSV files to Avro

2016-01-17 Thread Igor Berman
https://github.com/databricks/spark-avro ? On 17 January 2016 at 13:46, Gideon wrote: > Hi everyone, > > I'm writing a Scala program which uses Spark CSV > to read CSV files from a > directory. After reading the CSVs as

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
ster demo/Demo-1.0-SNAPSHOT-all.jar > > Cheers, > Michael > > > On 07.01.2016 22:41, Igor Berman wrote: > > share how you submit your job > what cluster(yarn, standalone) > > On 7 January 2016 at 23:24, Michael Pisula <michael.pis...@tngtech.com> > wrote: >

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
ffect. > I was able to increase the number of cores the job was using on one worker, > but it would not use any other worker (and it would not start if the number > of cores the job wanted was higher than the number available on one worker). > > > On 07.01.2016 22:51, Igor Berman wrote: &

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
share how you submit your job what cluster(yarn, standalone) On 7 January 2016 at 23:24, Michael Pisula wrote: > Hi there, > > I ran a simple Batch Application on a Spark Cluster on EC2. Despite having > 3 > Worker Nodes, I could not get the application processed on

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try

Re: partitioning json data in spark

2015-12-27 Thread Igor Berman
have you tried to specify format of your output, might be parquet is default format? df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote: > Hey all! > I am willing to partition *json *data by a column

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Igor Berman
sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий wrote: > hello > you can try to use

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
David, can you verify that mysql connector classes indeed in your single jar? open it with zip tool available at your platform another options that might be a problem - if there is some dependency in MANIFEST(not sure though this is the case of mysql connector) then it might be broken after

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
the Maven manifest goes, I'm really not sure. I will research > it though. Now I'm wondering if my mergeStrategy is to blame? I'm going > to try there next. > > Thank you for the help! > > On Tue, Dec 22, 2015 at 1:18 AM, Igor Berman <igor.ber...@gmail.com> > wrote: > >

Re: fishing for help!

2015-12-21 Thread Igor Berman
look for differences: packages versions, cpu/network/memory diff etc etc On 21 December 2015 at 14:53, Eran Witkon wrote: > Hi, > I know it is a wide question but can you think of reasons why a pyspark > job which runs on from server 1 using user 1 will run faster then

Re: Spark with log4j

2015-12-21 Thread Igor Berman
I think log4j.properties that are under conf dir are those that are relevant for workers jvms and not the one that you pack withing your jar On 21 December 2015 at 14:07, Kalpesh Jadhav wrote: > Hi Ted, > > > > Thanks for your response, But it doesn’t solve my

Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Igor Berman
check version compatibility I think avro lib should be 1.7.4 check that no other lib brings transitive dependency of other avro version On 16 December 2015 at 09:44, Jinyuan Zhou wrote: > Hi, I tried to load avro files in hdfs but keep getting NPE. > I am using

Re: Preventing an RDD from shuffling

2015-12-16 Thread Igor Berman
imho, you should implement your own rdd with mongo sharding awareness, then this rdd will have this mongo aware partitioner, and then incoming data will be partitioned by this partitioner in join not sure if it's simple task...but you have to get partitioner in you mongo rdd. On 16 December 2015

Re: Need Help Diagnosing/operating/tuning

2015-11-23 Thread Igor Berman
you should check why executor is killed. as soon as it's killed you can get all kind of strange exceptions... either give your executors more memory(4G is rather small for spark ) or try to decrease your input or maybe split it into more partitions in input format 23G in lzo might expand to x?

Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in cluster mode, so it's running on one of the slaves) change total cores to be 3*2 change submit mode to be client - you'll have full utilization (btw it's not advisable to use all cores of slave...since there is OS processes and

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll get what you want, however pay attention that when you'll move to multinode cluster - there will be difference On 20 November 2015 at 05:10, Afshartous, Nick wrote: > > < log4j.properties file

Re: status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
further reading code of MasterPage gave me what I want: http://:8080/json returns json view of all info presented in main page On 9 November 2015 at 22:41, Igor Berman <igor.ber...@gmail.com> wrote: > Hi, > How do I get status of workers(slaves) from driver? > why I need it - I wa

status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
Hi, How do I get status of workers(slaves) from driver? why I need it - I want to autoscale new workers and want to poll status of cluster(e.g. number of alive slaves connected) so that I'll submit job only after expected number of slaves joined cluster I've found MasterPage class that produces

Re: Whether Spark is appropriate for our use case.

2015-11-07 Thread Igor Berman
1. if you have join by some specific field(e.g. user id or account-id or whatever) you may try to partition parquet file by this field and then join will be more efficient. 2. you need to see in spark metrics what is performance of particular join, how much partitions is there, what is shuffle

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR that

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
gmail.com> wrote: > How to convert a parquet file that is saved in hdfs to an RDD after > reading the file from hdfs? > > On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman <igor.ber...@gmail.com> > wrote: > >> Hi, >> we are using avro with compression(snappy). As s

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data uncompressed in memory. For

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Igor Berman
many use it. how do you add aws sdk to classpath? check in environment ui what is in cp. you should make sure that in your cp the version is compatible with one that spark compiled with I think 1.7.4 is compatible(at least we use it) make sure that you don't get other versions from other

Re: spark straggle task

2015-10-20 Thread Igor Berman
We know that the JobScheduler have the function to assign the straggle task to another node. - only if you enable and configure spark.speculation On 20 October 2015 at 15:20, Triones,Deng(vip.com) wrote: > Hi All > > We run an application with version 1.4.1 standalone

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Igor Berman
Does ur iterations really submit job? I dont see any action there On Oct 17, 2015 00:03, "Jia Zhan" wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Igor Berman
thanks Ted :) On 18 October 2015 at 19:07, Ted Yu wrote: > Interesting reading material. > > bq. transformations that loose partitioner > > lose partitioner > > bq. Spark looses the partitioner > > loses the partitioner > > bq. Tunning number of partitions > > Should be

Re: Question about GraphX connected-components

2015-10-10 Thread Igor Berman
let's start from some basics: might be u need to split your data into more partitions? spilling depends on your configuration when you create graph(look for storage level param) and your global configuration. in addition, you assumption of 64GB/100M is probably wrong, since spark divides memory

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
fields, removing duplicates, and saving those results with > exactly the same schema. > > Thank you for the answer, at least I know that there is no way to make it > works. > > > 2015-10-09 20:19 GMT+02:00 Igor Berman <igor.ber...@gmail.com>: > >> u should create co

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
u should create copy of your avro data before working with it, i.e. just after loadFromHDFS map it into new instance that is deap copy of the object it's connected to the way spark/avro reader reads avro files(it reuses some buffer or something) On 9 October 2015 at 19:05, alberskib

Re: RDD of ImmutableList

2015-10-05 Thread Igor Berman
kryo doesn't support guava's collections by default I remember encountered project in github that fixes this(not sure though). I've ended to stop using guava collections as soon as spark rdds are concerned. On 5 October 2015 at 21:04, Jakub Dubovsky wrote: > Hi

Re: Getting spark application driver ID programmatically

2015-10-02 Thread Igor Berman
if driver id is application id then yes you can do it with String appId = ctx.sc().applicationId(); //when ctx is java context On 1 October 2015 at 20:44, Snehal Nagmote wrote: > Hi , > > I have use case where we need to automate start/stop of spark streaming >

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Igor Berman
Try to broadcasr header On Sep 22, 2015 08:07, "Balaji Vijayan" wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and > Scala IDE) but not the 3rd

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
as a starting point, attach your stacktrace... ps: look for duplicates in your classpath, maybe you include another jar with same class On 8 September 2015 at 06:38, Nicholas R. Peterson wrote: > I'm trying to run a Spark 1.4.1 job on my CDH5.4 cluster, through Yarn. >

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
oader.loadClass(Launcher.java:331) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > at > com.esotericsoftware.kryo.util.DefaultClassResolver.readName(DefaultCla

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
/Document > com/i2028/Document/Document$1.class > com/i2028/Document/Document.class > > What else can I do? Is there any way to get more information about the > classes available to the particular classloader kryo is using? > > On Tue, Sep 8, 2015 at 6:34 AM Igor

Re: Java vs. Scala for Spark

2015-09-08 Thread Igor Berman
we are using java7..its much more verbose that java8 or scala examples in addition there sometimes libraries that has no java api, so you need to write them by yourself(e.g. graphx) on the other hand, scala is not trivial language like java, so it depends on your team On 8 September 2015 at

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
com> wrote: > Yeah... none of the jars listed on the classpath contain this class. The > only jar that does is the fat jar that I'm submitting with spark-submit, > which as mentioned isn't showing up on the classpath anywhere. > > -- Nick > > On Tue, Sep 8, 2015 at 8:

Re: Problem to persist Hibernate entity from Spark job

2015-09-06 Thread Igor Berman
how do you create your session? do you reuse it across threads? how do you create/close session manager? look for the problem in session creation, probably something deadlocked, as far as I remember hib.session should be created per thread On 6 September 2015 at 07:11, Zoran Jeremic

Re: Tuning - tasks per core

2015-09-03 Thread Igor Berman
suppose you have 1 job that do some transformation, suppose you have X cores in your cluster and you are willing to give all of them to your job suppose you have no shuffles(to keep it simple) set number of partitions of your input data to be 3X or 2X, thus you'll get 2/3 tasks per each core On

Re: Managing httpcomponent dependency in Spark/Solr

2015-09-03 Thread Igor Berman
not sure if it will help, but have you checked https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html On 31 August 2015 at 19:33, Oliver Schrenk wrote: > Hi, > > We are running a distibuted indexing service for Solr (4.7) on a Spark > (1.2)

Re: bulk upload to Elasticsearch and shuffle behavior

2015-09-01 Thread Igor Berman
Hi Eric, I see that you solved your problem. Imho, when you do repartition you split your work into 2 stages, so your hbase lookup happens at first stage, and upload to ES happens after shuffle on next stage, so without repartition it's hard to tell where is ES upload and where is Hbase lookup

Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Igor Berman
are there other processes on sk3? or more generally are you sharing resources with somebody else, virtualization etc does your transformation consumes other services?(e.g. reading from s3, so it can happen that s3 latency plays the role...) can it be that task per some key will take longer than

Re: spark-submit issue

2015-08-31 Thread Igor Berman
might be you need to drain stdout/stderr of subprocess...otherwise subprocess can deadlock http://stackoverflow.com/questions/3054531/correct-usage-of-processbuilder On 27 August 2015 at 16:11, pranay wrote: > I have a java program that does this - (using Spark

Re: spark-submit issue

2015-08-31 Thread Igor Berman
.. i went > inside /proc//fd and just tailed "2" (stderr) and the process > immediately exits . > > > *From:* Igor Berman <igor.ber...@gmail.com> > *Sent:* Monday, August 31, 2015 12:41 PM > *To:* Pranay Tonpay > *Cc:* user > *Subject:* Re: spark-submit issue

Re: Parallel execution of RDDs

2015-08-31 Thread Igor Berman
what is size of the pool you submitting spark jobs from(futures you've mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't be more than 8 parallel jobs running...so try to increase it what is number of partitions of each of your rdds? how many cores has your worker

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
any differences in number of cores, memory settings for executors? On 19 August 2015 at 09:49, Rick Moritz rah...@gmail.com wrote: Dear list, I am observing a very strange difference in behaviour between a Spark 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
, there are different defaults for the two means of job submission that come into play in a non-transparent fashion (i.e. not visible in SparkConf). On Wed, Aug 19, 2015 at 1:36 PM, Igor Berman igor.ber...@gmail.com wrote: any differences in number of cores, memory settings for executors? On 19 August

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video... On 19 August 2015 at 18:34, Gourav Sengupta gourav.sengu...@gmail.com wrote: Excellent resource: http://www.oreilly.com/pub/e/3330 And more amazing is the fact that the presenter actually responds to your questions. Regards,

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Igor Berman
by default standalone creates 1 executor on every worker machine per application number of overall cores is configured with --total-executor-cores so in general if you'll specify --total-executor-cores=1 then there would be only 1 core on some executor and you'll get what you want on the other

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens netstat -a -t --numeric-ports On 7 August 2015 at 20:48, Jeff Jones jjo...@adaptivebiotech.com wrote: Thanks. Added this to both the client and the master but still not getting any more information. I confirmed the flag with ps. jjones53222 2.7

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish sebastian.ka...@gmail.com wrote: Hi everyone, I was

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file is huge you'll get OOM, however it depends on implementation, I'm not sure how it will be done nevertheless, worse to try the coallesce method(please post your results) another option would be

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
seems that coallesce do work, see following thread https://www.mail-archive.com/user%40spark.apache.org/msg00928.html On 5 August 2015 at 09:47, Igor Berman igor.ber...@gmail.com wrote: using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Igor Berman
org.apache.spark.io.LZ4CompressionCodec -- 原始邮件 -- *发件人:* Igor Berman;igor.ber...@gmail.com; *发送时间:* 2015年8月3日(星期一) 晚上7:56 *收件人:* Sea261810...@qq.com; *抄送:* Barak Gitsisbar...@similarweb.com; Ted Yuyuzhih...@gmail.com; user@spark.apache.orguser@spark.apache.org; rxinr

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Igor Berman
in general, what is your configuration? use --conf spark.logConf=true we have 1.4.1 in production standalone cluster and haven't experienced what you are describing can you verify in web-ui that indeed spark got your 50g per executor limit? I mean in configuration page.. might be you are using

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, Sujit Pal sujitatgt...@gmail.com wrote: No one has any ideas? Is there some more information I should provide? I am

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
of connections against Solr to a higher number. This time round, I would like to do this through Spark because it makes the pipeline less complex. -sujit On Sun, Aug 2, 2015 at 10:52 AM, Igor Berman igor.ber...@gmail.com wrote: What kind of cluster? How many cores on each worker? Is there config

Re: Too many open files

2015-07-29 Thread Igor Berman
you probably should increase file handles limit for user that all processes are running with(spark master workers) e.g. http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ On 29 July 2015 at 18:39, saif.a.ell...@wellsfargo.com wrote: Hello, I’ve seen a couple

Re: Spark Number of Partitions Recommendations

2015-07-29 Thread Igor Berman
imho, you need to take into account size of your data too if your cluster is relatively small, you may cause memory pressure on your executors if trying to repartition to some #cores connected number of partitions better to take some max between initial number of partitions(assuming your data is

Re: Is spark suitable for real time query

2015-07-22 Thread Igor Berman
you can use spark rest job server(or any other solution that provides long running spark context) so that you won't pay this bootstrap time on each query in addition : if you have some rdd that u want your queries to be executed on, you can cache this rdd in memory(depends on ur cluster memory

Re: Create RDD from output of unix command

2015-07-14 Thread Igor Berman
haven't you thought about spark streaming? there is thread that could help https://www.mail-archive.com/user%40spark.apache.org/msg30105.html On 14 July 2015 at 18:20, Hafsa Asif hafsa.a...@matchinguu.com wrote: Your question is very interesting. What I suggest is, that copy your output in

Re: Dependency Injection with Spark Java

2015-06-26 Thread Igor Berman
asked myself same question today...actually depends on what you are trying to do if you want injection into workers code I think it will be a bit hard... if only in code that driver executes i.e. in main, it's straight forward imho, just create your classes from injector(e.g. spring's application

Re: Spark standalone cluster - resource management

2015-06-23 Thread Igor Berman
probably there are already running jobs there in addition, memory is also a resource, so if you are running 1 application that took all your memory and then you are trying to run another application that asks for the memory the cluster doesn't have then the second app wont be running so why are u

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Igor Berman
Another option would be to close sc and open new context with your custom configuration On Jun 11, 2015 01:17, bhomass bhom...@gmail.com wrote: you need to register using spark-default.xml as explained here

Re: SparkContext Threading

2015-06-05 Thread Igor Berman
+1 to question about serializaiton. SparkContext is still in driver process(even if it has several threads from which you submit jobs) as for the problem, check your classpath, scala version, spark version etc. such errors usually happens when there is some conflict in classpath. Maybe you

Re: SparkContext Threading

2015-06-05 Thread Igor Berman
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? in yarn-cluster the driver program is executed inside one of nodes in cluster, so might be that driver code needs to be serialized to be sent to some node On 5 June 2015 at 22:55, Lee McFadden splee...@gmail.com wrote:

Re: Managing spark processes via supervisord

2015-06-03 Thread Igor Berman
assuming you are talking about standalone cluster imho, with workers you won't get any problems and it's straightforward since they are usually foreground processes with master it's a bit more complicated, ./sbin/start-master.sh goes background which is not good for supervisor, but anyway I think

Re: union and reduceByKey wrong shuffle?

2015-06-02 Thread Igor Berman
much work is to produce a small standalone reproduction? Can you create an Avro file with some mock data, maybe 10 or so records, then reproduce this locally? On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman igor.ber...@gmail.com wrote: switching to use simple pojos instead of using avro for spark

  1   2   >