Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
sense. Question: what are some good methods, tools, for combining the parts into a single, well-named file? I imagine that is outside of the scope of PySpark, but any advice is welcome. Thank you, Marco. On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh wrote: > AWS S3, or Google gs are had

Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
I need. However, the filenames are something like:part-0-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json Now, I understand Spark's need to include the partition number in the filename. However, it sure would be nice to control the rest of the file name. Any advice? Please and thank you. Marco.

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing the

Write custom JSON from DataFrame in PySpark

2023-05-03 Thread Marco Costantini
rements (serializing other things). Any advice? Please and thank you, Marco.

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
night! Thanks for your help team, Marco. On Wed, Apr 26, 2023 at 6:21 AM Mich Talebzadeh wrote: > Indeed very valid points by Ayan. How email is going to handle 1000s of > records. As a solution architect I tend to replace. Users by customers and > for each order there must be products sor

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
iteration (send email, send HTTP request, etc). Thanks Mich, Marco. On Tue, Apr 25, 2023 at 6:06 PM Mich Talebzadeh wrote: > Hi Marco, > > First thoughts. > > foreach() is an action operation that is to iterate/loop over each > element in the dataset, meaning cursor based. That

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, > > Let

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
achieves my goal by putting all of the 'orders' in a single Array column. Now my worry is, will this column become too large if there are a great many orders. Is there a limit? I have search for documentation on such a limit but could not find any. I truly appreciate your help Mich and team, Marco

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need his/her own list of Orders. Additionally, I need

What is the best way to organize a join within a foreach?

2023-04-24 Thread Marco Costantini
Marco Costantini 5:55 PM (5 minutes ago) to user I have two tables: {users, orders}. In this example, let's say that for each 1 User in the users table, there are 10 Orders in the orders table. I have to use pyspark to generate a statement of Orders for each User. So, a single user will need

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
Hmm, I think I got what Jingnan means. The lambda function is x != i and i is not evaluated when the lambda function was defined. So the pipelined rdd is rdd.filter(lambda x: x != i).filter(lambda x: x != i), rather than having the values of i substituted. Does that make sense to you, Sean? On

Spark RDD + HBase: adoption trend

2021-01-20 Thread Marco Firrincieli
Hi, my name is Marco and I'm one of the developers behind  https://github.com/unicredit/hbase-rdd  a project we are currently reviewing for various reasons. We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we're no

RDD filter in for loop gave strange results

2021-01-20 Thread Marco Wong
DD is [0, 2] Result is [0, 2] RDD is [0, 2] Filtered RDD is [0, 1] Result is [0, 1] ``` Thanks, Marco

Edge AI with Spark

2020-09-24 Thread Marco Sassarini
Hi, I'd like to know if Spark supports edge AI: can Spark run on physical device such as mobile devices running Android/iOS? Best regards, Marco Sassarini [cid:b995380c-a2a9-47fd-a865-edcad29e4206] Marco Sassarini Artificial Intelligence Department office: +39 0434 562 978 www.overit.it

Re: Exposing JIRA issue types at GitHub PRs

2019-06-13 Thread Marco Gaido
Hi Dongjoon, Thanks for the proposal! I like the idea. Maybe we can extend it to component too and to some jira labels such as correctness which may be worth to highlight in PRs too. My only concern is that in many cases JIRAs are created not very carefully so they may be incorrect at the moment

Re: testing frameworks

2019-02-04 Thread Marco Mistroni
Thanks Hichame will follow up on that Anyonen on this list using python version of spark-testing-base? seems theres support for DataFrame thanks in advance and regards Marco On Sun, Feb 3, 2019 at 9:58 PM Hichame El Khalfi wrote: > Hi, > You can use pysparkling => https://g

Re: testing frameworks

2019-02-03 Thread Marco Mistroni
Hi sorry to resurrect this thread Any spark libraries for testing code in pyspark? the github code above seems related to Scala following links in the original threads (and also LMGFY) i found out pytest-spark · PyPI <https://pypi.org/project/pytest-spark/> w/kindest regards Marco

Re: How to debug Spark job

2018-09-08 Thread Marco Mistroni
Hi Might sound like a dumb advice. But try to break apart your process. Sounds like you Are doing ETL start basic with just ET. and do the changes that results in issues If no problem add the load step Enable spark logging so that you can post error message to the list I think you can have a look

Reading multiple files in Spark / which pattern to use

2018-07-12 Thread Marco Mistroni
Could anyone help out? kind regards marco

Re: spark-shell gets stuck in ACCEPTED state forever when ran in YARN client mode.

2018-07-08 Thread Marco Mistroni
You running on emr? You checked the emr logs? Was in similar situation where job was stuck in accepted and then it died..turned out to be an issue w. My code when running g with huge data.perhaps try to reduce gradually the load til it works and then start from there? Not a huge help but I

Re: Error submitting Spark Job in yarn-cluster mode on EMR

2018-05-08 Thread Marco Mistroni
Did you by any chances left a sparkSession.setMaster("local") lurking in your code? Last time i checked, to run on yarn you have to package a 'fat jar'. could you make sure the spark depedencies in your jar matches the version you are running on Yarn? alternatively please share code including

Re: RV: Unintelligible warning arose out of the blue.

2018-05-04 Thread Marco Mistroni
Hi i think it has to do with spark configuration, dont think the standard configuration is geared up to be running in local mode on windows your dataframe is ok, you can check out that you have read it successfully by printing out df.count() and you will see your code is reading the dataframe

Re: Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-05-02 Thread Marco Mistroni
gt; messages if you don't have the correct permissions. > > On Tue, Apr 24, 2018, 2:28 PM Marco Mistroni <mmistr...@gmail.com> wrote: > >> HI all >> i am using the following code for persisting data into S3 (aws keys are >> already stored in the environment variab

Re: A naive ML question

2018-04-29 Thread Marco Mistroni
Maybe not necessarily what you want but you could, based on trans attributes, find out initial state and end state and give it to a decision tree to figure out if you if based on these attributes you can oreditc tinal stage Again, not what you asked but an idea to use ml for your data? Kr On Sun,

Re: Dataframe vs dataset

2018-04-28 Thread Marco Mistroni
Imho .neither..I see datasets as typed df and therefore ds are enhanced df Feel free to disagree.. Kr On Sat, Apr 28, 2018, 2:24 PM Michael Artz wrote: > Hi, > > I use Spark everyday and I have a good grip on the basics of Spark, so > this question isnt for myself. But

Problem in persisting file in S3 using Spark: xxx file does not exist Exception

2018-04-24 Thread Marco Mistroni
which seems bizzarre? I Have even tried to remove the coalesce ,but still got the same exception Could anyone help pls? kind regarsd marco

Re: Live Stream Code Reviews :)

2018-04-12 Thread Marco Mistroni
PST I believelike last time Works out 9pm bst & 10 pm cet if I m correct On Thu, Apr 12, 2018, 8:47 PM Matteo Olivi wrote: > Hi, > 11 am in which timezone? > > Il gio 12 apr 2018, 21:23 Holden Karau ha scritto: > >> Hi Y'all, >> >> If your

Re: Best active groups, forums or contacts for Spark ?

2018-01-26 Thread Marco Mistroni
Hi From personal experienceand I might be asking u obvious question 1. Does it work in standalone (no cluster) 2. Can u break down app in pieces and try to see at which step the code gets killed? 3. Have u had a look at spark gui to see if we executors go oom? I might be oversimplifying what

Re: good materiala to learn apache spark

2018-01-18 Thread Marco Mistroni
Jacek lawskowski on this mail list wrote a book which is available online. Hth On Jan 18, 2018 6:16 AM, "Manuel Sopena Ballesteros" < manuel...@garvan.org.au> wrote: > Dear Spark community, > > > > I would like to learn more about apache spark. I have a Horton works HDP > platform and have

Re: Please Help with DecisionTree/FeatureIndexer

2017-12-16 Thread Marco Mistroni
.setOutputCol("indexedFeatures") .setMaxCategories(5) // features with > 4 distinct values are treated as continuous. .fit(transformedData) ? Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x kr marco On Dec 16, 2017 1:24 PM, &qu

Please Help with DecisionTree/FeatureIndexer

2017-12-15 Thread Marco Mistroni
spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40) at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118) what am i missing? w/kindest regarsd marco

How to control logging in testing package com.holdenkarau.spark.testing.

2017-12-13 Thread Marco Mistroni
org.apache.log4j.Level import org.apache.log4j.{ Level, Logger } val rootLogger = Logger.getRootLogger() rootLogger.setLevel(Level.ERROR) Logger.getLogger("org").setLevel(Level.ERROR) Logger.getLogger("akka").setLevel(Level.ERROR) thanks and kr marco

Re: pyspark configuration with Juyter

2017-11-04 Thread Marco Mistroni
Hi probably not what u r looking for but if u get stuck with conda jupyther and spark, if u get an account @ community.cloudera you will enjoy jupyther and spark out of the box Gd luck and hth Kr On Nov 4, 2017 4:59 PM, "makoto" wrote: > I setup environment variables in

Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Marco Mistroni
a Docker container and run spark off that container... hth marco On Fri, Oct 20, 2017 at 5:57 PM, Aakash Basu <aakash.spark@gmail.com> wrote: > Hey Marco/Jagat, > > As I earlier informed you, that I've already done those basic checks and > permission changes. >

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
.map(word => (word, 1)) .reduceByKey(_ + _) It save the results into more than one partition like part-0, part-1. I want to collect all of them into one file. 2017-10-20 16:43 GMT+03:00 Marco Mistroni <mmistr...@gmail.com>: > Hi > Could

Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Marco Mistroni
xterm instead of control panel Hth Marco On Oct 20, 2017 8:31 AM, "Aakash Basu" <aakash.spark@gmail.com> wrote: Hi all, I have Spark 2.1 installed in my laptop where I used to run all my programs. PySpark wasn't used for around 1 month, and after starting it now, I'm gettin

Re: Write to HDFS

2017-10-20 Thread Marco Mistroni
Hi Could you just create an rdd/df out of what you want to save and store it in hdfs? Hth On Oct 20, 2017 9:44 AM, "Uğur Sopaoğlu" wrote: > Hi all, > > In word count example, > > val textFile = sc.textFile("Sample.txt") > val counts = textFile.flatMap(line =>

Re: Database insert happening two times

2017-10-17 Thread Marco Mistroni
Hi Uh if the problem is really with parallel exec u can try to call repartition(1) before u save Alternatively try to store data in a csv file and see if u have same behaviour, to exclude dynamodb issues Also ..are the multiple rows being written dupes (they have all same fields/values)? Hth On

Re: Quick one... AWS SDK version?

2017-10-07 Thread Marco Mistroni
Hi JG out of curiosity what's ur usecase? are you writing to S3? you could use Spark to do that , e.g using hadoop package org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client which is in line with hadoop 2.7.1? hth marco On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly

RE: Spark 2.2.0 Win 7 64 bits Exception while deleting Spark temp dir

2017-10-04 Thread Marco Mistroni
Hi Got similar issues on win 10. It has to do imho with the way permissions are setup in windows. That should not prevent ur program from getting back a result.. Kr On Oct 3, 2017 9:42 PM, "JG Perrin" wrote: > do you have a little more to share with us? > > > > maybe

Re: NullPointerException error while saving Scala Dataframe to HBase

2017-10-01 Thread Marco Mistroni
context you are using? Hth Marco On Oct 1, 2017 4:33 AM, <mailford...@gmail.com> wrote: Hi Guys- am not sure whether the email is reaching to the community members. Please can somebody acknowledge Sent from my iPhone > On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh <mailford...@gmai

Re: PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-16 Thread Marco Mistroni
> > This is what you want to do? > > On Fri, Sep 15, 2017 at 4:21 AM, Marco Mistroni <mmistr...@gmail.com> > wrote: > >> HI all >> could anyone assist pls? >> i am trying to flatMap a DataSet[(String, String)] and i am getting >> errors in Eclipse >

PLs assist: trying to FlatMap a DataSet / partially OT

2017-09-14 Thread Marco Mistroni
eter type what am i missing? or perhaps i am using the wrong approach? w/kindest regards Marco

Re: [Meetup] Apache Spark and Ignite for IoT scenarious

2017-09-07 Thread Marco Mistroni
Hi Will there be a podcast to view afterwards for remote EMEA users? Kr On Sep 7, 2017 12:15 AM, "Denis Magda" wrote: > Folks, > > Those who are craving for mind food this weekend come over the meetup - > Santa Clara, Sept 9, 9.30 AM: >

Re: SPARK Issue in Standalone cluster

2017-08-06 Thread Marco Mistroni
ns, sqlContext) logger.info('Out of here..') ## On Sat, Aug 5, 2017 at 9:09 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Uh believe me there are lots of ppl on this list who will send u code > snippets if u ask...  > > Yes that is what Steve po

Re: SPARK Issue in Standalone cluster

2017-08-05 Thread Marco Mistroni
an account there (though I guess I'll get there before me.) Try that out and let me know if u get stuck Kr On Aug 5, 2017 8:40 PM, "Gourav Sengupta" <gourav.sengu...@gmail.com> wrote: > Hi Marco, > > For the first time in several years FOR THE VERY FIRST TIME. I am

Re: SPARK Issue in Standalone cluster

2017-08-03 Thread Marco Mistroni
Hello my 2 cents here, hope it helps If you want to just to play around with Spark, i'd leave Hadoop out, it's an unnecessary dependency that you dont need for just running a python script Instead do the following: - got to the root of our master / slave node. create a directory /root/pyscripts -

Re: problem initiating spark context with pyspark

2017-06-10 Thread Marco Mistroni
On Thu, Jun 8, 2017 at 8:38 PM, Marco Mistroni <mmistr...@gmail.com> > wrote: > >> try this link >> >> http://letstalkspark.blogspot.co.uk/2016/02/getting-started- >> with-spark-on-window-64.html >> >> it helped me when i had similar problems with

Re: problem initiating spark context with pyspark

2017-06-08 Thread Marco Mistroni
try this link http://letstalkspark.blogspot.co.uk/2016/02/getting-started-with-spark-on-window-64.html it helped me when i had similar problems with windows... hth On Wed, Jun 7, 2017 at 3:46 PM, Curtis Burkhalter < curtisburkhal...@gmail.com> wrote: > Thanks Doc I saw this on another

Sampling data on RDD vs sampling data on Dataframes

2017-05-21 Thread Marco Didonna
that I can get an rdd from a dataframe, perform sampleByKeyExact and then convert the RDD back to a dataframe. I'd really like to avoid such conversion, if possibile. Thank you for any help you people can give :) Best, Marco

Re: Spark Testing Library Discussion

2017-04-26 Thread Marco Mistroni
Uh i stayed online in the other link but nobody joinedWill follow transcript Kr On 26 Apr 2017 9:35 am, "Holden Karau" wrote: > And the recording of our discussion is at https://www.youtube.com/ > watch?v=2q0uAldCQ8M > A few of us have follow up things and we will try

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
1.7.5 On 28 Mar 2017 10:10 pm, "Anahita Talebi" <anahita.t.am...@gmail.com> wrote: > Hi, > > Thanks for your answer. > What is the version of "org.slf4j" % "slf4j-api" in your sbt file? > I think the problem might come from this part. > >

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
ergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => > { > case PathList("javax", "servlet", xs @ _*) => > MergeStrategy.first > case PathList(ps @ _*) if ps.last endsWith ".html" => > MergeStrategy.first > case "application.conf"

Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Marco Mistroni
Hello that looks to me like there's something dodgy withyour Scala installation Though Spark 2.0 is built on Scala 2.11, it still support 2.10... i suggest you change one thing at a time in your sbt First Spark version. run it and see if it works Then amend the scala version hth marco On Tue

Re:

2017-03-09 Thread Marco Mistroni
Try to remove the Kafka code as it seems Kafka is not the issue. Here. Create a DS and save to Cassandra and see what happensEven in the console That should give u a starting point? Hth On 9 Mar 2017 3:07 am, "sathyanarayanan mudhaliyar" < sathyanarayananmudhali...@gmail.com> wrote:

Re: question on transforms for spark 2.0 dataset

2017-03-01 Thread Marco Mistroni
Hi I think u need an UDF if u want to transform a column Hth On 1 Mar 2017 4:22 pm, "Bill Schwanitz" wrote: > Hi all, > > I'm fairly new to spark and scala so bear with me. > > I'm working with a dataset containing a set of column / fields. The data > is stored in hdfs as

Re: error in kafka producer

2017-02-28 Thread Marco Mistroni
This exception coming from a Spark program? could you share few lines of code ? kr marco On Tue, Feb 28, 2017 at 10:23 PM, shyla deshpande <deshpandesh...@gmail.com> wrote: > producer send callback exception: > org.apache.kafka.common.errors.TimeoutException: > Expi

Re: Run spark machine learning example on Yarn failed

2017-02-28 Thread Marco Mistroni
Or place the file in s3 and provide the s3 path Kr On 28 Feb 2017 1:18 am, "Yunjie Ji" wrote: > After start the dfs, yarn and spark, I run these code under the root > directory of spark on my master host: > `MASTER=yarn ./bin/run-example ml.LogisticRegressionExample >

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-26 Thread Marco Mistroni
similar setup can be used on Linux) https://spark.apache.org/docs/latest/streaming-kafka-integration.html kr On Sat, Feb 25, 2017 at 11:12 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > Hi I have a look. At GitHub project tomorrow and let u know. U have a py > scripts to run and

Re: No main class set in JAR; please specify one with --class and java.lang.ClassNotFoundException

2017-02-25 Thread Marco Mistroni
Try to use --packages to include the jars. From error it seems it's looking for main class in jars but u r running a python script... On 25 Feb 2017 10:36 pm, "Raymond Xie" wrote: That's right Anahita, however, the class name is not indicated in the original github

Re: care to share latest pom forspark scala applications eclipse?

2017-02-24 Thread Marco Mistroni
Hi i am using sbt to generate ecliipse project file these are my dependencies they 'll probably translate to some thing like this in mvn dependencies these are same for all packages listed below org.apache,spark 2.1.0 spark-core_2.11 spark-streaming_2.11spark-mllib_2.11 spark-sql_2.11

Error when trying to filter

2017-02-21 Thread Marco Mans
; }); allData.show(); I get this error on the executors: java.lang.NoSuchMethodError: org.apache.commons.lang3.time.FastDateFormat.parse(Ljava/lang/String;)Ljava/util/Date; I'm running spark 2.0.0.cloudera1 Does anyone know why this error occurs? Regards, Marco

Basic Grouping Question

2017-02-20 Thread Marco Mans
the map and reduce function. I have the feeling that I am running in the wrong direction. Does anyone know how to approach this? (I hope I explained it right, so it can be understand :)) Regards, Marco

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-06 Thread Marco Mistroni
he spark connectors > have the appropriate transitive dependency on the correct version. > > On Sat, Feb 4, 2017 at 3:25 PM, Marco Mistroni <mmistr...@gmail.com> > wrote: > > Hi > > not sure if this will help at all, and pls take it with a pinch of salt >

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-04 Thread Marco Mistroni
t;group.id" -> "group1") val topics = List("testLogs").toSet val lines = KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent,

Re: Running a spark code on multiple machines using google cloud platform

2017-02-02 Thread Marco Mistroni
U can use EMR if u want to run. On a cluster Kr On 2 Feb 2017 12:30 pm, "Anahita Talebi" wrote: > Dear all, > > I am trying to run a spark code on multiple machines using submit job in > google cloud platform. > As the inputs of my code, I have a training and

Re: Suprised!!!!!Spark-shell showing inconsistent results

2017-02-02 Thread Marco Mistroni
Hi Have u tried to sort the results before comparing? On 2 Feb 2017 10:03 am, "Alex" wrote: > Hi As shown below same query when ran back to back showing inconsistent > results.. > > testtable1 is Avro Serde table... > > [image: Inline image 1] > > > > hc.sql("select *

Re: Hive Java UDF running on spark-sql issue

2017-02-01 Thread Marco Mistroni
Hi What is the UDF supposed to do? Are you trying to write a generic function to convert values to another type depending on what is the type of the original value? Kr On 1 Feb 2017 5:56 am, "Alex" wrote: Hi , we have Java Hive UDFS which are working perfectly fine in

Kafka dependencies in Eclipse project /Pls assist

2017-01-31 Thread Marco Mistroni
some dependencies clashing. Has any one encountered a similar error? kr marco

converting timestamp column to a java.util.Date

2017-01-23 Thread Marco Mistroni
i have 1 timestamp column and a bunch of strings. i will need to convert that to something compatible with a mongo's ISODate kr marco

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-18 Thread Marco Mistroni
t.uri", "mongodb://localhost:27017/test.tree")) kr marco On Tue, Jan 17, 2017 at 7:53 AM, Marco Mistroni <mmistr...@gmail.com> wrote: > Uh. Many thanksWill try it out > > On 17 Jan 2017 6:47 am, "Palash Gupta" <spline_pal...@yahoo.com> wrote: >

Re: Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Marco Mistroni
Uh. Many thanksWill try it out On 17 Jan 2017 6:47 am, "Palash Gupta" <spline_pal...@yahoo.com> wrote: > Hi Marco, > > What is the user and password you are using for mongodb connection? Did > you enable authorization? > > Better to include user & pass

Spark vs MongoDB: saving DataFrame to db raises missing database name exception

2017-01-16 Thread Marco Mistroni
hi all i have the folllowign snippet which loads a dataframe from a csv file and tries to save it to mongodb. For some reason, the MongoSpark.save method raises the following exception Exception in thread "main" java.lang.IllegalArgumentException: Missing database name. Set via the

Re: Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
sorry. should have done more research before jumping to the list the version of the connector is 2.0.0, available from maven repors sorry On Mon, Jan 16, 2017 at 9:32 PM, Marco Mistroni <mmistr...@gmail.com> wrote: > HI all > in searching on how to use Spark 2.0 with mongo i

Spark 2.0 vs MongoDb /Cannot find dependency using sbt

2017-01-16 Thread Marco Mistroni
HI all in searching on how to use Spark 2.0 with mongo i came across this link https://jira.mongodb.org/browse/SPARK-20 i amended my build.sbt (content below), however the mongodb dependency was not found Could anyone assist? kr marco name := "SparkExamples" version := "1.

Re: Importing a github project on sbt

2017-01-16 Thread Marco Mistroni
UhmNot a SPK issueAnyway...Had similar issues with sbt The quick sol. To get u going is to place ur dependency in your lib folder The notsoquick is to build the sbt dependency and do a sbt publish-local, or deploy local But I consider both approaches hacks. Hth On 16 Jan 2017 2:00

Re: Running Spark on EMR

2017-01-15 Thread Marco Mistroni
ng Spark in standalone mode. > > Regards > > > ---- Original message > From: Marco Mistroni > Date:15/01/2017 16:34 (GMT+02:00) > To: User > Subject: Running Spark on EMR > > hi all > could anyone assist here? > i am trying to run spark 2.0.0 on an EMR c

Running Spark on EMR

2017-01-15 Thread Marco Mistroni
ow shall i build the spark session and how can i submit a pythjon script to the cluster? kr marco

Re: Debugging a PythonException with no details

2017-01-14 Thread Marco Mistroni
It seems it has to do with UDF..Could u share snippet of code you are running? Kr On 14 Jan 2017 1:40 am, "Nicholas Chammas" wrote: > I’m looking for tips on how to debug a PythonException that’s very sparse > on details. The full exception is below, but the only

Re: backward compatibility

2017-01-10 Thread Marco Mistroni
I think old APIs are still supported but u r advised to migrate I migrated few apps from 1.6 to 2.0 with minimal changes Hth On 10 Jan 2017 4:14 pm, "pradeepbill" wrote: > hi there, I am using spark 1.4 code and now we plan to move to spark 2.0, > and > when I check

Re: Spark ML's RandomForestClassifier OOM

2017-01-10 Thread Marco Mistroni
You running locally? Found exactly same issue. 2 solutions: _ reduce datA size. _ run on EMR Hth On 10 Jan 2017 10:07 am, "Julio Antonio Soto" wrote: > Hi, > > I am running into OOM problems while training a Spark ML > RandomForestClassifier (maxDepth of 30, 32 maxBins, 100

Re: Spark Python in Jupyter Notebook

2017-01-05 Thread Marco Mistroni
Hi might be off topic, but databricks has a web application in whicn you can use spark with jupyter. have a look at https://community.cloud.databricks.com kr On Thu, Jan 5, 2017 at 7:53 PM, Jon G wrote: > I don't use MapR but I use pyspark with jupyter, and this MapR

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2017-01-05 Thread Marco Mistroni
Hi If it only happens when u run 2 app at same time could it be that these 2 apps somehow run on same host? Kr On 5 Jan 2017 9:00 am, "Palash Gupta" <spline_pal...@yahoo.com> wrote: > Hi Marco and respected member, > > I have done all the possible things suggest

Re: Re: Re: Spark Streaming prediction

2017-01-03 Thread Marco Mistroni
by gathering data for x days , feed it to your model and see results Hth On Mon, Jan 2, 2017 at 9:51 PM, Daniela S <daniela_4...@gmx.at> wrote: > Dear Marco > > No problem, thank you very much for your help! > Yes, that is correct. I always know the minute values for the next e.g.

Re: Re: Spark Streaming prediction

2017-01-02 Thread Marco Mistroni
dashboard somewhere (via actors/ JMS or whatever mechanism) kr marco On Mon, Jan 2, 2017 at 8:26 PM, Daniela S <daniela_4...@gmx.at> wrote: > Hi > > Thank you very much for your answer! > > My problem is that I know the values for the next 2-3 hours in advance but >

Re: Spark Streaming prediction

2017-01-02 Thread Marco Mistroni
somewhere and have your dashboard poll periodically your data store to read the predictions I have seen ppl on the list doing ML over a Spark streaming app, i m sure someone can reply back Hpefully i gave u a starting point hth marco On 2 Jan 2017 4:03 pm, "Daniela S" <daniel

Re: Error when loading json to spark

2017-01-01 Thread Marco Mistroni
WithSchema = sqlContext.jsonRDD(jsonRdd, schema) But somehow i seem to remember that there was a way , in Spark 2.0, so that Spark will infer the schema for you.. hth marco On Sun, Jan 1, 2017 at 12:40 PM, Raymond Xie <xie3208...@gmail.com> wrote: > I found the cause: > > I ne

Re: [ML] Converting ml.DenseVector to mllib.Vector

2016-12-31 Thread Marco Mistroni
and found this, mayb it'll help https://community.hortonworks.com/questions/33375/how-to-convert-a-dataframe-to-a-vectordense-in-sca.html hth marco On Sat, Dec 31, 2016 at 4:24 AM, Jason Wolosonovich <jmwol...@asu.edu> wrote: > Hello All, > > I'm working through the Data Science

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Marco Mistroni
in a standalone app works fine. Then what you can try is to do exactly the same processig you are doing but instead of loading csv files from HDFS you can load from local directory and see if the problem persists..(this just to exclude any issues with loading HDFS data.) hth Marco

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-30 Thread Marco Mistroni
ctionality" is to reduce > scope of your application's functionality so that you can isolate the issue > in certain part(s) of the app...I do not think he meant "reduce" operation > :) > > On Fri, Dec 30, 2016 at 9:26 PM, Palash Gupta <spline_pal...@yahoo.com. >

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
to trim down the code to a few lines that can reproduces the error That will be a great start Sorry for not being of much help hth marco On Thu, Dec 29, 2016 at 12:00 PM, Palash Gupta <spline_pal...@yahoo.com> wrote: > Hi Marco, > > Thanks for your response. > > Yes I

Re: [TorrentBroadcast] Pyspark Application terminated saying "Failed to get broadcast_1_ piece0 of broadcast_1 in Spark 2.0.0"

2016-12-29 Thread Marco Mistroni
Hi Pls try to read a CSV from filesystem instead of hadoop. If you can read it successfully then your hadoop file is the issue and you can start debugging from there. Hth On 29 Dec 2016 6:26 am, "Palash Gupta" wrote: > Hi Apache Spark User team, > > > >

Re: [Spark Core] - Spark dynamoDB integration

2016-12-12 Thread Marco Mistroni
Hi If it can help 1.Check Java docs of when that method was introduced 2. U building a fat jar? Check which libraries have been includedsome other dependencies might have forced an old copy to be included 3. If u. Take code outside spark.does it work successfully? 4. Send short

Re: Random Forest hangs without trace of error

2016-12-11 Thread Marco Mistroni
> I hope to be able to provide a good repro case in some weeks. If the > problem was in our own code I will also post it in this thread. > > Morten > > Den 10. dec. 2016 kl. 23.25 skrev Marco Mistroni <mmistr...@gmail.com>: > > Hello Morten > ok. > afaik there is a ti

Re: Random Forest hangs without trace of error

2016-12-10 Thread Marco Mistroni
found when playing around with RDF and decision trees and other ML algorithms If RDF is not a must for your usecase, could you try 'scale back' to Decision Trees and see if you still get intermittent failures? this at least to exclude issues with the data hth marco On Sat, Dec 10, 2016 at 5:20

Re: Random Forest hangs without trace of error

2016-12-10 Thread Marco Mistroni
Hi Bring back samples to 1k range to debugor as suggested reduce tree and bins had rdd running on same size data with no issues.or send me some sample code and data and I try it out on my ec2 instance ... Kr On 10 Dec 2016 3:16 am, "Md. Rezaul Karim"

Re: unit testing in spark

2016-12-09 Thread Marco Mistroni
Me too as I spent most of my time writing unit/integ tests pls advise on where I can start Kr On 9 Dec 2016 12:15 am, "Miguel Morales" wrote: > I would be interested in contributing. Ive created my own library for > this as well. In my blog post I talk about

Re: How to convert a unix timestamp column into date format(yyyy-MM-dd) ?

2016-12-04 Thread Marco Mistroni
Hi In python you can use date time.fromtimestamp(..).strftime('%Y%m%d') Which spark API are you using? Kr On 5 Dec 2016 7:38 am, "Devi P.V" wrote: > Hi all, > > I have a dataframe like following, > > ++---+ >

Re: java.lang.Exception: Could not compute split, block input-0-1480539568000 not found

2016-12-01 Thread Marco Mistroni
(Long v1, Long v2) throws Exception { >>>> return v1+v2; >>>> } >>>> }).foreachRDD(new VoidFunction<JavaPairRDD<String, Long>>() { >>>> @Override >>>> public void call(JavaPairRDD<String, Long> >>>> stringIntege

  1   2   3   >