Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at 11:4

Unsubscribe

2022-11-26 Thread Peter Kovgan
Shay Elbaz schrieb am So. 14. Aug. 2022 um 09:22: > Hi, > > I have a simple ETL application, where the data source schama needs to be > sanitized. Column names might include special characters that need to be > removed. For example, from "some{column}" to "some_column". > Normally I'd just alias

subscribe user@spark.apache.org

2021-01-19 Thread Peter Podlovics
Hello, I would like to subscribe to the above mailing list. I already tried subscribing through the webpage, but I still haven't received the email yet. Thanks, Peter

答复: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread peter
You can try improve setting spark.streaming.kafka.consumer.cache.maxCapacity 发件人: Shyam P [mailto:shyamabigd...@gmail.com] 发送时间: 2019年10月21日 20:43 收件人: kafka-clie...@googlegroups.com; spark users 主题: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey H

Re: read binary files (for stream reader) / spark 2.3

2019-09-09 Thread Peter Liu
#option) and other google results and didn't find an option for binary file. Any help would be very much appreciated! (thanks again for Ilya's helpful information below - works fine on sparkContext object) Regards, Peter On Thu, Sep 5, 2019 at 3:09 PM Ilya Matiach wrote: > Hi Peter

Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
https://spark.apache.org/docs/2.3.0/streaming-programming-guide.html https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.package any hint/help would be very much appreciated! thanks! Peter

unsubscribe

2019-08-06 Thread Peter Willis
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, Thanks for the additional information - this is really helpful (I definitively got more than I was looking for :-) Cheers, Peter On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko wrote: > Hi Peter, we're using a part of Crail - it's core library, called disni ( > ht

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hi Peter, we're using a part of Crail - it's core library, called disni ( https://github.com/zrlio/disni/). We couldn't reproduce results from that blog post, any case Crail is more platformic approach (it comes with it's own file system), while SparkRdma is a pluggable ap

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, thank you for the reply and detailed information! Would this something comparable with Crail? ( http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html) I was more looking for something simple/quick making the shuffle between the local jvms quicker (like the idea of using local

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin ( https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, Mellanox NIC support On Demand Paging, where OS invalidates translations which are no longer valid d

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
a quite practical and useful option/feature. not sure what is the status of this ticket implementation? Thanks! Peter On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair wrote: > Thanks..great info. Will try and let all know. > > Best > > On Thu, Oct 18, 2018, 3:12 AM onmstester onms

re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
overcommit: https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources any help/hint would be very much appreciated! Regards, Peter FYI: I have a system with 80 vcores and a relatively light spark streaming workload. overcomming the vocore resource (here 100) see

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
much as the faster one due to different batch processing time. would this be a functional issue? Or, the spark stage input data should not be directly interpreted/compared as/with incoming message data? I seem to miss something here. Thank you! Peter the fast system (batch time in sec; x i

re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
more context info and the real concern in case relevant. Have been struggling with this. Any comments and help would be very much appreciated. Thanks! Regards, Peter = (a) code in use: .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as

unsubscribe

2018-07-05 Thread Peter
unsubscribe

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
error-creating-kafka-stream i also rebuilt the workload with some spark 2.3.1 jars (see (b) below). it doesn't seem to help. Would be great if anyone could kindly share any insights here. Thanks! Peter (a) the exception Exception in thread "stream execution thread for [id = 5adae836

re: streaming - kafka partition transition time from (stage change logger)

2018-06-11 Thread Peter Liu
! Thanks ... Peter [2018-06-08 *15:34:36*,518] TRACE Controller 0 epoch 53 changed state of replica 0 for partition [events,79] from *ReplicaDeletionIneligible *to OfflineReplica (state.change.logger) [2018-06-08 15:34:36,945] TRACE Controller 0 epoch 53 changed state of replica 0 for partition [events

Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
ProcessingTime(10.seconds)) Java Example: df.writeStream().trigger(ProcessingTime.create("10 seconds")) import java.util.concurrent.TimeUnit df.writeStream().trigger(ProcessingTime.create(10, TimeUnit.SECONDS)) Muchly appreciated! Peter On Fri, May 25, 2018 at 9:11 AM, Jacek Laskows

re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
) and (b) mean? I was wondering if this would mean a different streaming approach ("traditional" streaming vs structured streaming? Basically I need to find a way to set the batch-interval in (b), similar as in (a) below. Would be great if someone can please share some insights here.

Re: Advice on multiple streaming job

2018-05-08 Thread Peter Liu
rker machine) that receives a single stream of data Any comment from you guys would be much appreciated! Cheers, Peter On Mon, May 7, 2018 at 5:08 AM, Dhaval Modi wrote: > Hi Gerard, > > Our source is kafka, and we are using standard streaming api (DStreams). > > Our requiremen

re: spark streaming / AnalysisException on collect()

2018-04-30 Thread Peter Liu
k/stackoverflow/cloudera/etc/etc. Any help would be very much appreciated! Thanks! Peter (a) streaming data (df) received on the consumer side (console sink): root |-- ad_id: string (nullable = true) |-- campaign_id: string (nullable = true) |-- timestamp: timestamp (

RE: filters Pushdown

2017-02-02 Thread Peter Shmukler
...@gmail.com] Sent: Thursday, February 2, 2017 12:12 PM To: Peter Shmukler Cc: user@spark.apache.org Subject: Re: filters Pushdown Pushdowns depend on the source connector. Join pushdown with Cassandra only Filter pushdown with mainly all sources with some specific constraints Le 2 févr. 2017 10:42 AM

filters Pushdown

2017-02-02 Thread Peter Sg
Can community help me to figure out some details about Spark: - Does Spark support filter Pushdown for types: o Int/long o DateTime o String - Does Spark support Pushdown of join operations for partitioned tables (in case of join condition includes partitioning field)

Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change your data infrastructure/data architecture. These days it's relatively convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side. For example, for one of my use-cases, I sto

Re: Parsing XML

2016-10-04 Thread Peter Figliozzi
It's pretty clear that df.col(xpath) is looking for a column named xpath in your df, not executing an xpath over an XML document as you wish. Try constructing a UDF which applies your xpath query, and give that as the second argument to withColumn. On Tue, Oct 4, 2016 at 4:35 PM, Jean Georges Per

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Peter Figliozzi
It's a good question. People have been publishing papers on decision trees and various methods of constructing and pruning them for over 30 years. I think it's rather a question for a historian at this point. On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty wrote: > Read this explanation but

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to Doub

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically if a field is NaN.. --> ( >

Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution: - Store your data in Cassandra. - Use the expiring data feature (TTL) so data will automatically be removed a month later. - Now in your Spark process, just read f

median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame column *once it is grouped. * It's easy enough now to get the min, max, mean, and other things that are part of spark.sql.functions: df.groupBy("foo", "bar").agg(mean($"column1")) And it's easy enough to get the median of a co

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
tructure. Note that repartitioning > is explicit shuffle. > > If you want to have only single file you need to repartition the whole RDD > to single partition. > Depending on the result data size it may be something that you want or do > not want to do ;-) > > Regards, >

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
iński < piotr.smolinski...@gmail.com> wrote: > Hi Peter, > > The blank file _SUCCESS indicates properly finished output operation. > > What is the topology of your application? > I presume, you write to local filesystem and have more than one worker > machine. > In such case Sp

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing anythi

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
t would be interesting is to see how much > time each task/job/stage takes. > > On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi > wrote: > >> It seems to me they must communicate for joins, sorts, grouping, and so >> forth, where the original data partitioning needs to cha

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so forth, where the original data partitioning needs to change. You could repeat your experiment for different code snippets. I'll bet it depends on what you do. On Thu, Sep 22, 2016 at 8:54 AM, gusiri wrote: > Hi, > > When I

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either. On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed,

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott wrote: > Hi a

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
> classes) are supported by importing spark.implicits._ Support for > serializing other types will be added in future releases. > dataStr.map(row => Vectors.parse(row.getString(1))) > > > Dose anyone can help me, > thanks very much! > > > > > > >

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
der? > > > On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi > wrote: > >> All (three) of them. It's kind of cool-- when I re-run collect() a different >> executor will show up as first to encounter the error. >> >> On Wed, Sep 7, 2016 at 8:20 PM, ayan gu

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
All (three) of them. It's kind of cool-- when I re-run collect() a different executor will show up as first to encounter the error. On Wed, Sep 7, 2016 at 8:20 PM, ayan guha wrote: > Hi > > Is it happening on all executors or one? > > On Thu, Sep 8, 2016 at 10:46 AM, Pete

Fwd: distribute work (files)

2016-09-07 Thread Peter Figliozzi
ks. Here's what I did to generate the error: val data = sc.textFile("file:///home/peter/datashare/*.txt") data.collect() It's working to some extent because if I put a bogus path in, I'll get a different (correct) error (InvalidInputException: Input Pattern file:/home/peter/dd

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
ith a wildcard Thanks, Pete On Tue, Sep 6, 2016 at 11:20 PM, ayan guha wrote: > To access local file, try with file:// URI. > > On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi > wrote: > >> This is a great question. Basically you don't have to worry about the >> detail

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
.org/index.html> is excellent. Pete On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai) wrote: > Hi Peter, > I'm familiar with Pandas / Numpy in python, while spark / scala is > totally new for me. > Pandas provides a detailed document, like how to slice data, parse file, > use ap

No SparkR on Mesos?

2016-09-07 Thread Peter Griessl
mation on this subject in the docs - am I missing something? Thanks for any hints, Peter

Re: distribute work (files)

2016-09-06 Thread Peter Figliozzi
This is a great question. Basically you don't have to worry about the details-- just give a wildcard in your call to textFile. See the Programming Guide section entitled "External Datasets". The Spark framework will distribute your dat

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical features column. Here's one way to do the individual transform: scala> val x = "[1, 2, 3, 4, 5]" x: String = [1, 2, 3, 4, 5] scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split(" ") map(_.toInt) y: Ar

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems an

how to investigate skew and DataFrames and RangePartitioner

2016-06-13 Thread Peter Halliday
does one achieve this now. Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Error writing parquet to S3

2016-06-10 Thread Peter Halliday
Has anyone else seen this before? Before when I saw this there was an OOM but doesn’t seem so. Of course, I’m not sure how large the file that created this was either. Peter > On Jun 9, 2016, at 9:00 PM, Peter Halliday wrote: > > I’m not 100% sure why I’m getting this. I don’

Error writing parquet to S3

2016-06-09 Thread Peter Halliday
I’m not 100% sure why I’m getting this. I don’t see any errors before this at all. I’m not sure how to diagnose this. Peter Halliday 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, ip-172-16

UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Peter Halliday
I have some code that was producing OOM during shuffle and was RDD. So, upon direction by a member of Databricks I started covering to Datasets. However, when we did we are getting an error that seems to be not liking something within one of our case classes. Peter Halliday [2016-06-08 19

Re: EMR Spark log4j and metrics

2016-04-15 Thread Peter Halliday
I wonder if anyone can confirm is Spark on YARN the problem here? Or is it how AWS has put it together? I'm wondering if Spark on YARN has problems with configuration files for the workers and driver? Peter Halliday On Thu, Apr 14, 2016 at 1:09 PM, Peter Halliday wrote: > An update

Re: EMR Spark log4j and metrics

2016-04-14 Thread Peter Halliday
x27;t see evidence than the configuration files are read from or used after they pushed On Wed, Apr 13, 2016 at 11:22 AM, Peter Halliday wrote: > I have an existing cluster that I stand up via Docker images and > CloudFormation Templates on AWS. We are moving to EMR and AWS Data > Pi

EMR Spark log4j and metrics

2016-04-13 Thread Peter Halliday
jar than’s sent via —jars to spark-submit. Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
It doesn't matter - just an example. Imagine yarn cluster with 100GB of ram and i submit simultaneously a lot of jobs in a loop. Thanks, Peter Rudenko On 4/6/16 7:22 PM, Ted Yu wrote: Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn'

[Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
ing for a while. Is it possible to set some sort of timeout for acquiring executors otherwise kill application? Thanks, Peter Rudenko - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands,

Re: spark.ml : eval model outside sparkContext

2016-03-16 Thread Peter Rudenko
Hi Emmanuel, looking for a similar solution. For now found only: https://github.com/truecar/mleap Thanks, Peter Rudenko On 3/16/16 12:47 AM, Emmanuel wrote: Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train

FileAlreadyExistsException and Streaming context

2016-03-08 Thread Peter Halliday
stack trace: http://pastebin.com/AqBFXkga <http://pastebin.com/AqBFXkga> Peter Halliday

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
I haven’t trie spark.hadoop.validateOutputSpecs. However, it seems that has to do with the existence of the output directory itself and not the files. Maybe I’m wrong? Peter > On Mar 1, 2016, at 11:53 AM, Sabarish Sasidharan > wrote: > > Have you tried spark.hadoop.validat

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
, but no plans on changing this. I’m surprised not to see this fixed yet. Peter Halliday > On Mar 1, 2016, at 10:01 AM, Ted Yu wrote: > > Do you mind pastebin'ning the stack trace with the error so that we know > which part of the code is under discussion ? > > Th

Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
1.5.1 version of this code doesn’t allow for this to be passed in. Is that correct? Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

SaveMode, parquet and S3

2016-02-29 Thread Peter Halliday
case. or is the SaveMode only apply to the files as a whole? Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Hadoop credentials missing in some tasks?

2016-02-05 Thread Peter Vandenabeele
queries just work (including the count(*) queries that failed before). This is using a Zeppelin notebook with Spark 1.5.1 with Hadoop 2.4. HTH, Peter

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Peter Zhang
Could you run spark-shell at $SPARK_HOME DIR? You can try to change you command run at $SPARK_HOME or, point to README.md with full path. Peter Zhang --  Google Sent with Airmail On January 19, 2016 at 11:26:14, Oleg Ruchovets (oruchov...@gmail.com) wrote: It looks spark is not working fine

Re: SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Thanks,  I will try. Peter --  Google Sent with Airmail On January 19, 2016 at 12:44:46, Jeff Zhang (zjf...@gmail.com) wrote: Please make sure you export environment variable HADOOP_CONF_DIR which contains the core-site.xml On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang wrote: Hi all, http

SparkR with Hive integration

2016-01-18 Thread Peter Zhang
file:/user/hive/warehouse/src 16/01/19 12:11:51 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) How  to use HDFS instead of local file system(file)? Which parameter should to set? Thanks a lot. Peter Zhang --  Google Sent with Airmail

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Peter Zhang
Hi Eran, Missing import package. import org.apache.spark.sql.types._ will work. please try. Peter Zhang --  Google Sent with Airmail On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com) wrote: Hi, I am using spark-shell with version 1.5.2. scala

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
As i've tried cgroups - seems the isolation is done by percantage not by cores number. E.g. i've set min share to 256 - i still see all 8 cores, but i could only load only 20% of each core. Thanks, Peter Rudenko On 2015-11-10 15:52, Saisai Shao wrote: From my understanding, it

[Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
l 8 cores? Thanks, Peter Rudenko

[Yarn] How to set user in ContainerLaunchContext?

2015-11-02 Thread Peter Rudenko
ataOutputBuffer(); credentials.writeTokenStorageToStream(dob); ByteBuffer.wrap(dob.getData(),0, dob.getLength()).duplicate(); } val cCLC = Records.newRecord(classOf[ContainerLaunchContext]) cCLC.setCommands(List("spark-submit --master yarn ...")) cCLC.setTokens(setupTokens(user)) Thanks, Peter Rudenko

input file from tar.gz

2015-09-29 Thread Peter Rudenko
Hi, i have a huge tar.gz file on dfs. This file contains several files, but i want to use only one of them as input. Is it possible to filter somehow a tar.gz schema, something like this: sc.textFile("hdfs:///data/huge.tar.gz#input.txt") Thanks, Pet

Re: GroupBy Java objects in Java Spark

2015-09-28 Thread Peter Bollerman
Hi, You will want to make sure your key JavaObject implements equals() and hashCode() appropriately. Otherwise you may not get the groupings you expect Peter Bollerman Principal Software Engineer The Allant Group, Inc. 630-718-3830 On Thu, Sep 24, 2015 at 5:27 AM, Sabarish Sasidharan

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Peter Rudenko
Cache(true) boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]]) val model = GradientBoostedTrees.train(instances, boostingStrategy) | Thanks, Peter Rudenko On 2015-08-14 00:33, Sean Owen wrote: Not that I have

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-08-30 Thread Peter Aberline
/apache/spark/pull/8384 For anyone interested if this functionality, please check it out, I’m keen to get feedback on it. Thanks Peter. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
two, 2->three) SI1 0 1 1 2 VectorAssembler.setInputCols(SI1, SI2).setOutputCol(features) -> features 00 11 01 22 HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1) bucket1 bucket2 a,a,b c HT1 3 //Hash collision 3 3 1 Thanks, Peter Rudenko On 2015-08-07 09:55, praveen S wrote: Is StringInde

Re: Delete NA in a dataframe

2015-08-04 Thread Peter Rudenko
thing like this: val rv = allyears2k.filter("COLUMN != `NA`") Thanks, Peter Rudenko On 2015-08-04 15:03, clark djilo kuissu wrote: Hello, I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package When I do this: allyears2k.na.drop() I have no r

Re: Spark and Speech Recognition

2015-07-30 Thread Peter Wolf
his? > > val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls > => speachRecognizer(urls)) > > Let 24 be the total number of cores that you have on all the workers. > > Thanks > Best Regards > > On Wed, Jul 29, 2015 at 6:50 AM, Pe

Spark and Speech Recognition

2015-07-28 Thread Peter Wolf
How does one configure Spark for this sort of application? How does one control how Spark deploys the stages of the process. Can someone point me to an appropriate doc or keywords I should Google. Thanks Peter

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Peter Leventis
Thank you for the answers. I followed numerous recipes including videos and uncounted many obstacles such as 7-Zip unable to unzip the *.gx file and to the need to use SBT. My situation is fixed. I use a Windows 7 PC (not Linux). I would be very grateful for an approach that simply works. This is

Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-25 Thread Peter Leventis
I just wanted an easy step by step guide as to exactly what version of what ever to download for a Proof of Concept installation of Apache Spark on Windows 7. I have spent quite some time following a number of different recipes to no avail. I have tried about 10 different permutations to date. I

LDA on a large dataset

2015-07-20 Thread Peter Zvirinsky
spark 1.4.0 in yarn-client mode. I managed to learn a word2vec model on the same dataset with no problems at all. Thanks, Peter

Re: what is metadata in StructField ?

2015-07-15 Thread Peter Rudenko
ibute/attributes.scala Take a look how i'm using metadata to get summary statistics from h2o: https://github.com/h2oai/sparkling-water/pull/17/files Let me know if you'll have questions. Thanks, Peter Rudenko On 2015-07-15 12:48, matd wrote: I see in StructField that we can p

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Peter Rudenko
n resources cleaned when spark application correctly terminates (using sc.stop()). But in my case when it fills all disk space it was stucked and couldn't stop correctly. After i restarted yarn i don't know how easily trigger cache cleanup except of manually on all the nodes. Thanks, Pe

How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Peter Rudenko
understood is of APPLICATION type. Is it possible to restrict a disk space for spark application? Will spark fail if it wouldn't be able to persist on disk (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source? Thanks, Peter Rudenko

Re: spark - redshift !!!

2015-07-07 Thread Peter Zybrick
Hi, I have done a lot of EMR->S3->Redshift using Redshift COPY, haven't done any from Spark yet but I plan on doing it soon and have been doing some research. Take a look at this article - Best Practices for Micro-Batch Loading on Amazon Redshift https://blogs.aws.amazon.com/bigdata/post/Tx2ANLN1P

Re: MLLib- Probabilities with LogisticRegression

2015-06-30 Thread Peter Rudenko
Hi Klaus, you can use new ml api with dataframe: val model = (new LogisticRegresion).setInputCol("fetures").setProbabilityCol("probability").setOutputCol("prediction").fit(data) Thanks, Peter Rudenko On 2015-06-30 14:00, Klaus Schaefers wrote: Hello, is the

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Peter Rudenko
; spark-shell Thanks, Peter Rudenko On 2015-06-25 20:37, Daniel Haviv wrote: Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemCla

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Peter Rudenko
ossvalidators that uses other technics than grid search (random search crossvalidator, bayesian optimization CV, etc.). Thanks, Peter Rudenko On 2015-06-18 01:58, Xiangrui Meng wrote: On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira wrote: Hi, I am currently experimenting with linear regress

DataFrame and JDBC regression?

2015-06-14 Thread Peter Haumer
me the same error. Has someone else seen this? Am I using the new API correctly? It seems that the SaveModes are broken now. I saw a jira bug that talks about doing wasteful introspections when the table exists, but this seems to be more like a blocking regression. Thanks and best regards, Peter Haumer.

[Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-12 Thread Peter Haumer
ctJavaRDDLike.reduce( JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) Thanks and best regards, Peter Haumer.

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Peter Rudenko
Hi Brandon, they are available, but private to ml package. They are now public in 1.4. For 1.3.1 you can define your transformer in org.apache.spark.ml package - then you could use these traits. Thanks, Peter Rudenko On 2015-06-04 20:28, Brandon Plaster wrote: Is "HasInputCol

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
I'm afraid there's no such class for 1.2.1. This API was added to 1.3.0 AFAIK. On 2015-06-02 21:40, Dimp Bhat wrote: Thanks Peter. Can you share the Tokenizer.java class for Spark 1.2.1. Dimple On Tue, Jun 2, 2015 at 10:51 AM, Peter Rudenko mailto:petro.rude...@gmail.com>>

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
Frame method. Thanks, Peter On 2015-06-02 20:19, dimple wrote: Hi, I would like to embed my own transformer in the Spark.ml Pipleline but do not see an example of it. Can someone share an example of which classes/interfaces I need to extend/implement in order to do so. Thanks. Dimple -- View

Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko
Hi Cesar, try to do: hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) It's a bit inefficient, but should shuffle the whole dataframe. Thanks, Peter Rudenko On 2015-06-01 22:49, Cesar Flores wrote: I would like to know what will be the best approach to ran

Re: Multilabel classification using logistic regression

2015-05-27 Thread Peter Garbers
ache()) // should use groupBy >> val models = groupedRDDs.map(rdd => new >> LogisticRegressionWithLBFGS().setNumClasses(101).run(rdd)) >> val results = models.map(m => m.predict(Vectors.dense(query.features))) >> >> Thanks >> >> Peter >> >> &

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
in without filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha wrote: > How did you end up with thousands of df? Are you using streaming? In that > case you can do foreachRDD and keep merging incoming rdds to single r

Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
e multiple DataFrames within the one Parquet File using PySpark? Or is the only way to achieve this to union the DataFrames into one? Thanks, Peter

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hm, thanks. Do you know what this setting mean: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178 ? Thanks, Peter Rudenko On 2015-05-08 17:48, ayan guha wrote: From S3. As the dependency of df will be on s3. And because rdds are

[SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hi, i have a next question: |val data = sc.textFile("s3:///") val df = data.toDF df.saveAsParquetFile("hdfs://") df.someAction(...) | if during someAction some workers would die, would recomputation download files from s3 or from hdfs parquet? Thanks, Peter Rudenko ​

  1   2   >