Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-02-29 Thread Peter Toth
Congratulations and thanks Jungtaek for driving this! Xinrong Meng ezt írta (időpont: 2024. márc. 1., P, 5:24): > Congratulations! > > Thanks, > Xinrong > > On Thu, Feb 29, 2024 at 11:16 AM Dongjoon Hyun > wrote: > >> Congratulations! >> >> Bests, >> Dongjoon. >> >> On Wed, Feb 28, 2024 at

Unsubscribe

2022-11-26 Thread Peter Kovgan
Shay Elbaz schrieb am So. 14. Aug. 2022 um 09:22: > Hi, > > I have a simple ETL application, where the data source schama needs to be > sanitized. Column names might include special characters that need to be > removed. For example, from "some{column}" to "some_column". > Normally I'd just alias

subscribe user@spark.apache.org

2021-01-19 Thread Peter Podlovics
Hello, I would like to subscribe to the above mailing list. I already tried subscribing through the webpage, but I still haven't received the email yet. Thanks, Peter

答复: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

2019-10-21 Thread peter
You can try improve setting spark.streaming.kafka.consumer.cache.maxCapacity 发件人: Shyam P [mailto:shyamabigd...@gmail.com] 发送时间: 2019年10月21日 20:43 收件人: kafka-clie...@googlegroups.com; spark users 主题: Issue : KafkaConsumer cache hitting max capacity of 64, removing consumer for CacheKey

Re: read binary files (for stream reader) / spark 2.3

2019-09-09 Thread Peter Liu
#option) and other google results and didn't find an option for binary file. Any help would be very much appreciated! (thanks again for Ilya's helpful information below - works fine on sparkContext object) Regards, Peter On Thu, Sep 5, 2019 at 3:09 PM Ilya Matiach wrote: > Hi Peter, > >

Re: read image or binary files / spark 2.3

2019-09-05 Thread Peter Liu
://spark.apache.org/docs/2.3.0/streaming-programming-guide.html https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.package any hint/help would be very much appreciated! thanks! Peter

unsubscribe

2019-08-06 Thread Peter Willis
unsubscribe - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, Thanks for the additional information - this is really helpful (I definitively got more than I was looking for :-) Cheers, Peter On Fri, Oct 19, 2018 at 12:53 PM Peter Rudenko wrote: > Hi Peter, we're using a part of Crail - it's core library, called disni ( > https://gith

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hi Peter, we're using a part of Crail - it's core library, called disni ( https://github.com/zrlio/disni/). We couldn't reproduce results from that blog post, any case Crail is more platformic approach (it comes with it's own file system), while SparkRdma is a pluggable approach - it's just

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Liu
Hi Peter, thank you for the reply and detailed information! Would this something comparable with Crail? ( http://crail.incubator.apache.org/blog/2017/11/rdmashuffle.html) I was more looking for something simple/quick making the shuffle between the local jvms quicker (like the idea of using local

Re: Spark In Memory Shuffle / 5403

2018-10-19 Thread Peter Rudenko
Hey Peter, in SparkRDMA shuffle plugin ( https://github.com/Mellanox/SparkRDMA) we're using mmap of shuffle file, to do Remote Direct Memory Access. If the shuffle data is bigger then RAM, Mellanox NIC support On Demand Paging, where OS invalidates translations which are no longer valid due

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu
be a quite practical and useful option/feature. not sure what is the status of this ticket implementation? Thanks! Peter On Thu, Oct 18, 2018 at 6:51 AM ☼ R Nair wrote: > Thanks..great info. Will try and let all know. > > Best > > On Thu, Oct 18, 2018, 3:12 AM onmstester onms

re: yarn resource overcommit: cpu / vcores

2018-10-11 Thread Peter Liu
overcommit: https://www.slideshare.net/HadoopSummit/investing-the-effects-of-overcommitting-yarn-resources any help/hint would be very much appreciated! Regards, Peter FYI: I have a system with 80 vcores and a relatively light spark streaming workload. overcomming the vocore resource (here 100) see

Re: [External Sender] re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
as the faster one due to different batch processing time. would this be a functional issue? Or, the spark stage input data should not be directly interpreted/compared as/with incoming message data? I seem to miss something here. Thank you! Peter the fast system (batch time in sec; x

re: streaming, batch / spark 2.2.1

2018-08-02 Thread Peter Liu
ntext info and the real concern in case relevant. Have been struggling with this. Any comments and help would be very much appreciated. Thanks! Regards, Peter = (a) code in use: .selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Times

unsubscribe

2018-07-05 Thread Peter
unsubscribe

Re: spark 2.3.1 with kafka spark-streaming-kafka-0-10 (java.lang.AbstractMethodError)

2018-06-28 Thread Peter Liu
-creating-kafka-stream i also rebuilt the workload with some spark 2.3.1 jars (see (b) below). it doesn't seem to help. Would be great if anyone could kindly share any insights here. Thanks! Peter (a) the exception Exception in thread "stream execution thread for [id = 5adae836-268a-4ebf

re: streaming - kafka partition transition time from (stage change logger)

2018-06-11 Thread Peter Liu
! Thanks ... Peter [2018-06-08 *15:34:36*,518] TRACE Controller 0 epoch 53 changed state of replica 0 for partition [events,79] from *ReplicaDeletionIneligible *to OfflineReplica (state.change.logger) [2018-06-08 15:34:36,945] TRACE Controller 0 epoch 53 changed state of replica 0 for partition [events

Re: help with streaming batch interval question needed

2018-05-25 Thread Peter Liu
le: df.writeStream().trigger(ProcessingTime.create("10 seconds")) import java.util.concurrent.TimeUnit df.writeStream().trigger(ProcessingTime.create(10, TimeUnit.SECONDS)) Muchly appreciated! Peter On Fri, May 25, 2018 at 9:11 AM, Jacek Laskowski <ja...@japila.pl> wrote: >

re: help with streaming batch interval question needed

2018-05-24 Thread Peter Liu
in (a) and (b) mean? I was wondering if this would mean a different streaming approach ("traditional" streaming vs structured streaming? Basically I need to find a way to set the batch-interval in (b), similar as in (a) below. Would be great if someone can please share some insights here. Than

Re: Advice on multiple streaming job

2018-05-08 Thread Peter Liu
eceives a single stream of data Any comment from you guys would be much appreciated! Cheers, Peter On Mon, May 7, 2018 at 5:08 AM, Dhaval Modi <dhavalmod...@gmail.com> wrote: > Hi Gerard, > > Our source is kafka, and we are using standard streaming api (DStreams). > >

re: spark streaming / AnalysisException on collect()

2018-04-30 Thread Peter Liu
ckoverflow/cloudera/etc/etc. Any help would be very much appreciated! Thanks! Peter (a) streaming data (df) received on the consumer side (console sink): root |-- ad_id: string (nullable = true) |-- campaign_id: string (nullable = true) |-- timestamp: timestamp (nulla

RE: filters Pushdown

2017-02-02 Thread Peter Shmukler
...@gmail.com] Sent: Thursday, February 2, 2017 12:12 PM To: Peter Shmukler <pe...@varonis.com> Cc: user@spark.apache.org Subject: Re: filters Pushdown Pushdowns depend on the source connector. Join pushdown with Cassandra only Filter pushdown with mainly all sources with some specific constraints

filters Pushdown

2017-02-02 Thread Peter Sg
Can community help me to figure out some details about Spark: - Does Spark support filter Pushdown for types: o Int/long o DateTime o String - Does Spark support Pushdown of join operations for partitioned tables (in case of join condition includes partitioning

Re: Design patterns for Spark implementation

2016-12-08 Thread Peter Figliozzi
Keeping in mind Spark is a parallel computing engine, Spark does not change your data infrastructure/data architecture. These days it's relatively convenient to read data from a variety of sources (S3, HDFS, Cassandra, ...) and ditto on the output side. For example, for one of my use-cases, I

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Peter Figliozzi
It's a good question. People have been publishing papers on decision trees and various methods of constructing and pruning them for over 30 years. I think it's rather a question for a historian at this point. On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty wrote: >

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to

Re: Treadting NaN fields in Spark

2016-09-28 Thread Peter Figliozzi
In Scala, x.isNaN returns true for Double.NaN, but false for any character. I guess the `isnan` function you are using works by ultimately looking at x.isNan. On Wed, Sep 28, 2016 at 5:56 AM, Mich Talebzadeh wrote: > > This is an issue in most databases. Specifically

Re: read multiple files

2016-09-27 Thread Peter Figliozzi
If you're up for a fancy but excellent solution: - Store your data in Cassandra. - Use the expiring data feature (TTL) so data will automatically be removed a month later. - Now in your Spark process, just read

median of groups

2016-09-26 Thread Peter Figliozzi
I'm trying to figure out a nice way to get the median of a DataFrame column *once it is grouped. * It's easy enough now to get the min, max, mean, and other things that are part of spark.sql.functions: df.groupBy("foo", "bar").agg(mean($"column1")) And it's easy enough to get the median of a

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
repartitioning > is explicit shuffle. > > If you want to have only single file you need to repartition the whole RDD > to single partition. > Depending on the result data size it may be something that you want or do > not want to do ;-) > > Regards, > Piotr > > > &g

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-26 Thread Peter Figliozzi
< piotr.smolinski...@gmail.com> wrote: > Hi Peter, > > The blank file _SUCCESS indicates properly finished output operation. > > What is the topology of your application? > I presume, you write to local filesystem and have more than one worker > machine. > In such case Spa

Writing Dataframe to CSV yields blank file called "_SUCCESS"

2016-09-25 Thread Peter Figliozzi
Both df.write.csv("/path/to/foo") and df.write.format("com.databricks.spark.csv").save("/path/to/foo") results in a *blank* file called "_SUCCESS" under /path/to/foo. My df has stuff in it.. tried this with both my real df, and a quick df constructed from literals. Why isn't it writing

Re: Is executor computing time affected by network latency?

2016-09-23 Thread Peter Figliozzi
y. What would be interesting is to see how much > time each task/job/stage takes. > > On Thu, Sep 22, 2016 at 5:11 PM Peter Figliozzi <pete.figlio...@gmail.com> > wrote: > >> It seems to me they must communicate for joins, sorts, grouping, and so >> forth, where the original

Re: Is executor computing time affected by network latency?

2016-09-22 Thread Peter Figliozzi
It seems to me they must communicate for joins, sorts, grouping, and so forth, where the original data partitioning needs to change. You could repeat your experiment for different code snippets. I'll bet it depends on what you do. On Thu, Sep 22, 2016 at 8:54 AM, gusiri

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-21 Thread Peter Figliozzi
I'm sure there's another way to do it; I hope someone can show us. I couldn't figure out how to use `map` either. On Wed, Sep 21, 2016 at 3:32 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Thanks, Peter. > It works! > > Why udf is needed? > > > > > On Wed

Re: Similar Items

2016-09-20 Thread Peter Figliozzi
Related question: is there anything that does scalable matrix multiplication on Spark? For example, we have that long list of vectors and want to construct the similarity matrix: v * T(v). In R it would be: v %*% t(v) Thanks, Pete On Mon, Sep 19, 2016 at 3:49 PM, Kevin Mellott

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-20 Thread Peter Figliozzi
ng, etc) and Product types (case > classes) are supported by importing spark.implicits._ Support for > serializing other types will be added in future releases. > dataStr.map(row => Vectors.parse(row.getString(1))) > > > Dose anyone can help me, > thanks very much! > > > &

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
.textFiles to that folder? > > > On Thu, Sep 8, 2016 at 11:26 AM, Peter Figliozzi <pete.figlio...@gmail.com > > wrote: > >> All (three) of them. It's kind of cool-- when I re-run collect() a different >> executor will show up as first to encounter the erro

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
All (three) of them. It's kind of cool-- when I re-run collect() a different executor will show up as first to encounter the error. On Wed, Sep 7, 2016 at 8:20 PM, ayan guha <guha.a...@gmail.com> wrote: > Hi > > Is it happening on all executors or one? > > On Thu, Sep 8, 20

Fwd: distribute work (files)

2016-09-07 Thread Peter Figliozzi
. Here's what I did to generate the error: val data = sc.textFile("file:///home/peter/datashare/*.txt") data.collect() It's working to some extent because if I put a bogus path in, I'll get a different (correct) error (InvalidInputException: Input Pattern file:/home/peter/ddatashare/*.txt match

Re: distribute work (files)

2016-09-07 Thread Peter Figliozzi
with a wildcard Thanks, Pete On Tue, Sep 6, 2016 at 11:20 PM, ayan guha <guha.a...@gmail.com> wrote: > To access local file, try with file:// URI. > > On Wed, Sep 7, 2016 at 8:52 AM, Peter Figliozzi <pete.figlio...@gmail.com> > wrote: > >> This is a great question. Basica

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-07 Thread Peter Figliozzi
ndex.html> is excellent. Pete On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Hi Peter, > I'm familiar with Pandas / Numpy in python, while spark / scala is > totally new for me. > Pandas provides a detailed document, like how to slice data, parse file, &

No SparkR on Mesos?

2016-09-07 Thread Peter Griessl
tion on this subject in the docs - am I missing something? Thanks for any hints, Peter

Re: distribute work (files)

2016-09-06 Thread Peter Figliozzi
This is a great question. Basically you don't have to worry about the details-- just give a wildcard in your call to textFile. See the Programming Guide section entitled "External Datasets". The Spark framework will distribute your

Re: Dataframe, Java: How to convert String to Vector ?

2016-09-06 Thread Peter Figliozzi
Hi Yan, I think you'll have to map the features column to a new numerical features column. Here's one way to do the individual transform: scala> val x = "[1, 2, 3, 4, 5]" x: String = [1, 2, 3, 4, 5] scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "") split(" ") map(_.toInt) y:

Re: What do I loose if I run spark without using HDFS or Zookeeper?

2016-08-25 Thread Peter Figliozzi
Spark is a parallel computing framework. There are many ways to give it data to chomp down on. If you don't know why you would need HDFS, then you don't need it. Same goes for Zookeeper. Spark works fine without either. Much of what we read online comes from people with specialized problems

how to investigate skew and DataFrames and RangePartitioner

2016-06-13 Thread Peter Halliday
does one achieve this now. Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Error writing parquet to S3

2016-06-10 Thread Peter Halliday
Has anyone else seen this before? Before when I saw this there was an OOM but doesn’t seem so. Of course, I’m not sure how large the file that created this was either. Peter > On Jun 9, 2016, at 9:00 PM, Peter Halliday <pjh...@cornell.edu> wrote: > > I’m not 100% sure

Error writing parquet to S3

2016-06-09 Thread Peter Halliday
I’m not 100% sure why I’m getting this. I don’t see any errors before this at all. I’m not sure how to diagnose this. Peter Halliday 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, ip-172-16

UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Peter Halliday
I have some code that was producing OOM during shuffle and was RDD. So, upon direction by a member of Databricks I started covering to Datasets. However, when we did we are getting an error that seems to be not liking something within one of our case classes. Peter Halliday [2016-06-08 19

Re: EMR Spark log4j and metrics

2016-04-15 Thread Peter Halliday
I wonder if anyone can confirm is Spark on YARN the problem here? Or is it how AWS has put it together? I'm wondering if Spark on YARN has problems with configuration files for the workers and driver? Peter Halliday On Thu, Apr 14, 2016 at 1:09 PM, Peter Halliday <pjh...@cornell.edu>

Re: EMR Spark log4j and metrics

2016-04-14 Thread Peter Halliday
see evidence than the configuration files are read from or used after they pushed On Wed, Apr 13, 2016 at 11:22 AM, Peter Halliday <pjh...@cornell.edu> wrote: > I have an existing cluster that I stand up via Docker images and > CloudFormation Templates on AWS. We are moving to

EMR Spark log4j and metrics

2016-04-13 Thread Peter Halliday
to a jar than’s sent via —jars to spark-submit. Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: [Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
It doesn't matter - just an example. Imagine yarn cluster with 100GB of ram and i submit simultaneously a lot of jobs in a loop. Thanks, Peter Rudenko On 4/6/16 7:22 PM, Ted Yu wrote: Which hadoop release are you using ? bq. yarn cluster with 2GB RAM I assume 2GB is per node. Isn't this too

[Yarn] Spark AMs dead lock

2016-04-06 Thread Peter Rudenko
or a while. Is it possible to set some sort of timeout for acquiring executors otherwise kill application? Thanks, Peter Rudenko - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-ma

Re: spark.ml : eval model outside sparkContext

2016-03-16 Thread Peter Rudenko
Hi Emmanuel, looking for a similar solution. For now found only: https://github.com/truecar/mleap Thanks, Peter Rudenko On 3/16/16 12:47 AM, Emmanuel wrote: Hello, In MLLib with Spark 1.4, I was able to eval a model by loading it and using `predict` on a vector of features. I would train

FileAlreadyExistsException and Streaming context

2016-03-08 Thread Peter Halliday
the stack trace: http://pastebin.com/AqBFXkga <http://pastebin.com/AqBFXkga> Peter Halliday

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
I haven’t trie spark.hadoop.validateOutputSpecs. However, it seems that has to do with the existence of the output directory itself and not the files. Maybe I’m wrong? Peter > On Mar 1, 2016, at 11:53 AM, Sabarish Sasidharan > <sabarish.sasidha...@manthan.com> wrote: > &

Re: Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
, but no plans on changing this. I’m surprised not to see this fixed yet. Peter Halliday > On Mar 1, 2016, at 10:01 AM, Ted Yu <yuzhih...@gmail.com> wrote: > > Do you mind pastebin'ning the stack trace with the error so that we know > which part of the code is under discus

Get rid of FileAlreadyExistsError

2016-03-01 Thread Peter Halliday
the 1.5.1 version of this code doesn’t allow for this to be passed in. Is that correct? Peter Halliday - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Hadoop credentials missing in some tasks?

2016-02-05 Thread Peter Vandenabeele
executors and after that, alle queries just work (including the count(*) queries that failed before). This is using a Zeppelin notebook with Spark 1.5.1 with Hadoop 2.4. HTH, Peter

Re: spark 1.6.0 on ec2 doesn't work

2016-01-18 Thread Peter Zhang
Could you run spark-shell at $SPARK_HOME DIR? You can try to change you command run at $SPARK_HOME or, point to README.md with full path. Peter Zhang --  Google Sent with Airmail On January 19, 2016 at 11:26:14, Oleg Ruchovets (oruchov...@gmail.com) wrote: It looks spark is not working fine

SparkR with Hive integration

2016-01-18 Thread Peter Zhang
/warehouse/src 16/01/19 12:11:51 ERROR DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:file:/user/hive/warehouse/src is not a directory or unable to create one) How  to use HDFS instead of local file system(file)? Which parameter should to set? Thanks a lot. Peter Zhang --  Google Sent with Airmail

Re: SparkR with Hive integration

2016-01-18 Thread Peter Zhang
Thanks,  I will try. Peter --  Google Sent with Airmail On January 19, 2016 at 12:44:46, Jeff Zhang (zjf...@gmail.com) wrote: Please make sure you export environment variable HADOOP_CONF_DIR which contains the core-site.xml On Mon, Jan 18, 2016 at 8:23 PM, Peter Zhang <zhangju...@gmail.

Re: error: not found: value StructType on 1.5.2

2015-12-20 Thread Peter Zhang
Hi Eran, Missing import package. import org.apache.spark.sql.types._ will work. please try. Peter Zhang --  Google Sent with Airmail On December 20, 2015 at 21:43:42, Eran Witkon (eranwit...@gmail.com) wrote: Hi, I am using spark-shell with version 1.5.2. scala

Re: [Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
As i've tried cgroups - seems the isolation is done by percantage not by cores number. E.g. i've set min share to 256 - i still see all 8 cores, but i could only load only 20% of each core. Thanks, Peter Rudenko On 2015-11-10 15:52, Saisai Shao wrote: From my understanding, it depends

[Yarn] Executor cores isolation

2015-11-10 Thread Peter Rudenko
l 8 cores? Thanks, Peter Rudenko

[Yarn] How to set user in ContainerLaunchContext?

2015-11-02 Thread Peter Rudenko
tputBuffer(); credentials.writeTokenStorageToStream(dob); ByteBuffer.wrap(dob.getData(),0, dob.getLength()).duplicate(); } val cCLC = Records.newRecord(classOf[ContainerLaunchContext]) cCLC.setCommands(List("spark-submit --master yarn ...")) cCLC.setTokens(setupTokens(user)) Thanks, Peter Rudenko

input file from tar.gz

2015-09-29 Thread Peter Rudenko
Hi, i have a huge tar.gz file on dfs. This file contains several files, but i want to use only one of them as input. Is it possible to filter somehow a tar.gz schema, something like this: sc.textFile("hdfs:///data/huge.tar.gz#input.txt") Thanks, Pet

Re: GroupBy Java objects in Java Spark

2015-09-28 Thread Peter Bollerman
Hi, You will want to make sure your key JavaObject implements equals() and hashCode() appropriately. Otherwise you may not get the groupings you expect Peter Bollerman Principal Software Engineer The Allant Group, Inc. 630-718-3830 On Thu, Sep 24, 2015 at 5:27 AM, Sabarish Sasidharan

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-09-03 Thread Peter Rudenko
Cache(true) boostingStrategy.treeStrategy.setCategoricalFeaturesInfo( mapAsJavaMap(categoricalFeatures).asInstanceOf[java.util.Map[java.lang.Integer, java.lang.Integer]]) val model = GradientBoostedTrees.train(instances, boostingStrategy) | Thanks, Peter Rudenko On 2015-08-14 00:33, Sean Owen wrote: Not that I have

Re: Spark Python with SequenceFile containing numpy deserialized data in str form

2015-08-30 Thread Peter Aberline
/apache/spark/pull/8384 For anyone interested if this functionality, please check it out, I’m keen to get feedback on it. Thanks Peter. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: StringIndexer + VectorAssembler equivalent to HashingTF?

2015-08-07 Thread Peter Rudenko
(SI1, SI2).setOutputCol(features) - features 00 11 01 22 HashingTF.setNumFeatures(2).setInputCol(COL1).setOutputCol(HT1) bucket1 bucket2 a,a,b c HT1 3 //Hash collision 3 3 1 Thanks, Peter Rudenko On 2015-08-07 09:55, praveen S wrote: Is StringIndexer + VectorAssembler equivalent

Re: Delete NA in a dataframe

2015-08-04 Thread Peter Rudenko
this: val rv = allyears2k.filter(COLUMN != `NA`) Thanks, Peter Rudenko On 2015-08-04 15:03, clark djilo kuissu wrote: Hello, I try to magage NA in this dataset. I import my dataset with the com.databricks.spark.csv package When I do this: allyears2k.na.drop() I have no result. Can you help me

Re: Spark and Speech Recognition

2015-07-30 Thread Peter Wolf
(/sigmoid/audio/data/, 24).foreachPartition(urls = speachRecognizer(urls)) Let 24 be the total number of cores that you have on all the workers. Thanks Best Regards On Wed, Jul 29, 2015 at 6:50 AM, Peter Wolf opus...@gmail.com wrote: Hello, I am writing a Spark application to use speech

Re: Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-26 Thread Peter Leventis
Thank you for the answers. I followed numerous recipes including videos and uncounted many obstacles such as 7-Zip unable to unzip the *.gx file and to the need to use SBT. My situation is fixed. I use a Windows 7 PC (not Linux). I would be very grateful for an approach that simply works. This is

Download Apache Spark on Windows 7 for a Proof of Concept installation

2015-07-25 Thread Peter Leventis
I just wanted an easy step by step guide as to exactly what version of what ever to download for a Proof of Concept installation of Apache Spark on Windows 7. I have spent quite some time following a number of different recipes to no avail. I have tried about 10 different permutations to date. I

LDA on a large dataset

2015-07-20 Thread Peter Zvirinsky
to learn a word2vec model on the same dataset with no problems at all. Thanks, Peter

Re: what is metadata in StructField ?

2015-07-15 Thread Peter Rudenko
/attributes.scala Take a look how i'm using metadata to get summary statistics from h2o: https://github.com/h2oai/sparkling-water/pull/17/files Let me know if you'll have questions. Thanks, Peter Rudenko On 2015-07-15 12:48, matd wrote: I see in StructField that we can provide metadata

Re: How to restrict disk space for spark caches on yarn?

2015-07-13 Thread Peter Rudenko
application correctly terminates (using sc.stop()). But in my case when it fills all disk space it was stucked and couldn't stop correctly. After i restarted yarn i don't know how easily trigger cache cleanup except of manually on all the nodes. Thanks, Peter Rudenko On 2015-07-10 20:07, Andrew

How to restrict disk space for spark caches on yarn?

2015-07-10 Thread Peter Rudenko
understood is of APPLICATION type. Is it possible to restrict a disk space for spark application? Will spark fail if it wouldn't be able to persist on disk (StorageLevel.MEMORY_AND_DISK_SER) or it would recompute from data source? Thanks, Peter Rudenko

Re: spark - redshift !!!

2015-07-07 Thread Peter Zybrick
Hi, I have done a lot of EMR-S3-Redshift using Redshift COPY, haven't done any from Spark yet but I plan on doing it soon and have been doing some research. Take a look at this article - Best Practices for Micro-Batch Loading on Amazon Redshift

Re: MLLib- Probabilities with LogisticRegression

2015-06-30 Thread Peter Rudenko
Hi Klaus, you can use new ml api with dataframe: val model = (new LogisticRegresion).setInputCol(fetures).setProbabilityCol(probability).setOutputCol(prediction).fit(data) Thanks, Peter Rudenko On 2015-06-30 14:00, Klaus Schaefers wrote: Hello, is there a way to get the during the predict

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Peter Rudenko
Thanks, Peter Rudenko On 2015-06-25 20:37, Daniel Haviv wrote: Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584

Re: Parallel parameter tuning: distributed execution of MLlib algorithms

2015-06-17 Thread Peter Rudenko
technics than grid search (random search crossvalidator, bayesian optimization CV, etc.). Thanks, Peter Rudenko On 2015-06-18 01:58, Xiangrui Meng wrote: On Fri, May 22, 2015 at 6:15 AM, Hugo Ferreira h...@inesctec.pt wrote: Hi, I am currently experimenting with linear regression (SGD) (Spark

DataFrame and JDBC regression?

2015-06-14 Thread Peter Haumer
this? Am I using the new API correctly? It seems that the SaveModes are broken now. I saw a jira bug that talks about doing wasteful introspections when the table exists, but this seems to be more like a blocking regression. Thanks and best regards, Peter Haumer.

[Spark 1.4.0] java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation

2015-06-12 Thread Peter Haumer
( JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) Thanks and best regards, Peter Haumer.

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-04 Thread Peter Rudenko
Hi Brandon, they are available, but private to ml package. They are now public in 1.4. For 1.3.1 you can define your transformer in org.apache.spark.ml package - then you could use these traits. Thanks, Peter Rudenko On 2015-06-04 20:28, Brandon Plaster wrote: Is HasInputCol and HasOutputCol

Re: Embedding your own transformer in Spark.ml Pipleline

2015-06-02 Thread Peter Rudenko
method. Thanks, Peter On 2015-06-02 20:19, dimple wrote: Hi, I would like to embed my own transformer in the Spark.ml Pipleline but do not see an example of it. Can someone share an example of which classes/interfaces I need to extend/implement in order to do so. Thanks. Dimple -- View

Re: Dataframe random permutation?

2015-06-01 Thread Peter Rudenko
Hi Cesar, try to do: hc.createDataFrame(df.rdd.coalesce(NUM_PARTITIONS, shuffle =true),df.schema) It's a bit inefficient, but should shuffle the whole dataframe. Thanks, Peter Rudenko On 2015-06-01 22:49, Cesar Flores wrote: I would like to know what will be the best approach to randomly

Re: Multilabel classification using logistic regression

2015-05-27 Thread Peter Garbers
(Vectors.dense(query.features))) Thanks Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-classification-using-logistic-regression-tp23054.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha guha.a...@gmail.com wrote: How did you end up with thousands of df? Are you using streaming? In that case you can do foreachRDD and keep merging incoming rdds to single rdd

Re: [SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hm, thanks. Do you know what this setting mean: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L1178 ? Thanks, Peter Rudenko On 2015-05-08 17:48, ayan guha wrote: From S3. As the dependency of df will be on s3. And because rdds

[SQL][Dataframe] Change data source after saveAsParquetFile

2015-05-08 Thread Peter Rudenko
Hi, i have a next question: |val data = sc.textFile(s3:///) val df = data.toDF df.saveAsParquetFile(hdfs://) df.someAction(...) | if during someAction some workers would die, would recomputation download files from s3 or from hdfs parquet? Thanks, Peter Rudenko ​

Re: Multilabel Classification in spark

2015-05-06 Thread Peter Garbers
that there are not any in the spark-mllib core library. Am I missing something? Would someone point me in the right direction? Thanks! Peter -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Multilabel-Classification-in-spark-tp22775.html Sent from the Apache

[Ml][Dataframe] Ml pipeline dataframe repartitioning

2015-04-24 Thread Peter Rudenko
practice to handle partitions in dataframes with a lots of columns? Should i repartition manually after adding columns? What’s better faster: Applying 30 transformers for each numeric column or combine these columns to 1 vector column and apply 1 transformer? Thanks, Peter Rudenko ​

Reading files from http server

2015-04-13 Thread Peter Rudenko
downloading them first to hdfs?. Something like this: sc.textFile( http://azuremlsampleexperiments.blob.core.windows.net/criteo/day_{0-23}.gz;), so it will have 24 partitions. Thanks, Peter Rudenko

Re: From DataFrame to LabeledPoint

2015-04-02 Thread Peter Rudenko
Hi try next code: |val labeledPoints: RDD[LabeledPoint] = features.zip(labels).map{ case Row(feture1, feture2,..., label) = LabeledPoint(label, Vectors.dense(feature1, feature2, ...)) } | Thanks, Peter Rudenko On 2015-04-02 17:17, drarse wrote: Hello!, I have a questions since days ago

Python Example sql.py not working in version spark-1.3.0-bin-hadoop2.4

2015-03-27 Thread Peter Mac
error occurs when using bin/pyspark shell. from pyspark.sql import StructField Traceback (most recent call last): File stdin, line 1, in module ImportError: cannot import name StructField --- Any advice for resolving? Thanks in advance. Peter

  1   2   >