Re: Assigning a unique row ID

2017-04-07 Thread Subhash Sriram
Hi, We use monotonically_increasing_id() as well, but just cache the table first like Ankur suggested. With that method, we get the same keys in all derived tables. Thanks, Subhash Sent from my iPhone > On Apr 7, 2017, at 7:32 PM, Everett Anderson wrote: > > Hi,

Re: Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi, Thanks, but that's using a random UUID. Certainly unlikely to have collisions, but not guaranteed. I'd rather prefer something like monotonically_increasing_id or RDD's zipWithUniqueId but with better behavioral characteristics -- so they don't surprise people when 2+ outputs derived from an

Re: Assigning a unique row ID

2017-04-07 Thread Ankur Srivastava
You can use zipWithIndex or the approach Tim suggested or even the one you are using but I believe the issue is that tableA is being materialized every time you for the new transformations. Are you caching/persisting the table A? If you do that you should not see this behavior. Thanks Ankur On

Re: Assigning a unique row ID

2017-04-07 Thread Tim Smith
http://stackoverflow.com/questions/37231616/add-a-new-column-to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson wrote: > Hi, > > What's the best way to assign a truly unique row ID (rather than a hash) > to a

Assigning a unique row ID

2017-04-07 Thread Everett Anderson
Hi, What's the best way to assign a truly unique row ID (rather than a hash) to a DataFrame/Dataset? I originally thought that functions.monotonically_increasing_id would do this, but it seems to have a rather unfortunate property that if you add it as a column to table A and then derive tables

BucketedRandomProjectionLSHModel algorithm details

2017-04-07 Thread vvinton
Hi There, Using spark-mllib_2.11-2.1.0. Facing issue that BucketedRandomProjectionLSHModel.approxNearestNeighbors returns one result, always. Dataset looks like: +++-++--+ | id|

Structured streaming and writing output to Cassandra

2017-04-07 Thread shyla deshpande
Is anyone using structured streaming and writing the results to Cassandra database in a production environment? I do not think I have enough expertise to write a custom sink that can be used in production environment. Please help!

Re: Apache Drill vs Spark SQL

2017-04-07 Thread Pierce Lamb
Hi Kant, If you are interested in using Spark alongside a database to serve real time queries, there are many options. Almost every popular database has built some sort of connector to Spark. I've listed a majority of them and tried to delineate them in some way in this StackOverflow answer:

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Definitely agree with gourav there. I wouldn't want jenkins to run my work flow. Seems to me that you would only be using jenkins for its scheduling capabilities Yes you can run tests but you wouldn't want it to run your orchestration of jobs What happens if jenkijs goes down for any particular

Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2 the query planner is heavily used throughout Dataset code base. Are there any sites I can go to that explain the technical details, more than just from a high-level prospective

Re: reducebykey

2017-04-07 Thread Ankur Srivastava
Hi Stephen, If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it behaves as reduceByKey on RDD. Only if you use flatMapGroups and mapGroups it behaves as groupByKey on RDD and if you read the API documentation it warns of using the API. Hope this helps. Thanks Ankur On

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Gourav Sengupta
Hi Steve, Why would you ever do that? You are suggesting the use of a CI tool as a workflow and orchestration engine. Regards, Gourav Sengupta On Fri, Apr 7, 2017 at 4:07 PM, Steve Loughran wrote: > If you have Jenkins set up for some CI workflow, that can do scheduled

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Jörn Franke
Maybe using ranger or sentry would be the better choice to intercept those calls? > On 7. Apr 2017, at 16:32, Alvaro Brandon wrote: > > I was going through the SparkContext.textFile() and I was wondering at that > point does Spark communicates with HDFS. Since when

Re: Does Spark uses its own HDFS client?

2017-04-07 Thread Steve Loughran
On 7 Apr 2017, at 15:32, Alvaro Brandon > wrote: I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Steve Loughran
If you have Jenkins set up for some CI workflow, that can do scheduled builds and tests. Works well if you can do some build test before even submitting it to a remote cluster On 7 Apr 2017, at 10:15, Sam Elamin > wrote: Hi Shyla You

small job runs out of memory using wholeTextFiles

2017-04-07 Thread Paul Tremblay
As part of my processing, I have the following code: rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 10) rdd.count() The s3 directory has about 8GB of data and 61,878 files. I am using Spark 2.1, and running it with 15 modes of m3.xlarge nodes on EMR. The job fails with this error:

Does Spark uses its own HDFS client?

2017-04-07 Thread Alvaro Brandon
I was going through the SparkContext.textFile() and I was wondering at that point does Spark communicates with HDFS. Since when you download Spark binaries you also specify the Hadoop version you will use, I'm guessing it has its own client that calls HDFS wherever you specify it in the

reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been using

Cant convert Dataset to case class with Option fields

2017-04-07 Thread Dirceu Semighini Filho
Hi Devs, I've some case classes here, and it's fields are all optional case class A(b:Option[B] = None, c: Option[C] = None, ...) If I read some data in a DataSet and try to connvert it to this case class using the as method, it doesn't give me any answer, it simple freeze. If I change the case

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jörn Franke
How do you read them? > On 7. Apr 2017, at 12:11, Jacek Laskowski wrote: > > Hi, > > If your Spark app uses snappy in the code, define an appropriate library > dependency to have it on classpath. Don't rely on transitive dependencies. > > Jacek > > On 7 Apr 2017 8:34

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
It's true that CrossValidator is not parallel currently - see https://issues.apache.org/jira/browse/SPARK-19357 and feel free to help review. On Fri, 7 Apr 2017 at 14:18 Aseem Bansal wrote: > >- Limited the data to 100,000 records. >- 6 categorical feature which go

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
- Limited the data to 100,000 records. - 6 categorical feature which go through imputation, string indexing, one hot encoding. The maximum classes for the feature is 100. As data is imputated it becomes dense. - 1 numerical feature. - Training Logistic Regression through

Re: Spark 2.1 ml library scalability

2017-04-07 Thread Nick Pentreath
What is the size of training data (number examples, number features)? Dense or sparse features? How many classes? What commands are you using to submit your job via spark-submit? On Fri, 7 Apr 2017 at 13:12 Aseem Bansal wrote: > When using spark ml's LogisticRegression,

Re: distinct query getting stuck at ShuffleBlockFetcherIterator

2017-04-07 Thread Ramesh Krishnan
Hi Yash, Thank you for the response. Sorry it was not at distinct but it was at a join stage. It was a self join. There were no errors and the jobs were stuck at the step for a around 7 hrs, the last message that came through was . *ShuffleBlockFetcherIterator: Started 4 remote fetches* Thanks,

Spark 2.1 ml library scalability

2017-04-07 Thread Aseem Bansal
When using spark ml's LogisticRegression, RandomForest, CrossValidator etc. do we need to give any consideration while coding in making it scale with more CPUs or does it scale automatically? I am reading some data from S3, using a pipeline to train a model. I am running the job on a spark

Is checkpointing in Spark Streaming Synchronous or Asynchronous ?

2017-04-07 Thread kant kodali
Hi All, Is checkpointing in Spark Streaming Synchronous or Asynchronous ? other words can spark continue processing the stream while checkpointing? Thanks!

Re: Returning DataFrame for text file

2017-04-07 Thread Jacek Laskowski
Hi, What's the alternative? Dataset? You've got textFile then. It's an older API from the ages when Dataset was merely experimental. Jacek On 29 Mar 2017 8:58 p.m., "George Obama" wrote: > Hi, > > I saw that the API, either R or Scala, we are returning DataFrame for >

Re: reading snappy eventlog files from hdfs using spark

2017-04-07 Thread Jacek Laskowski
Hi, If your Spark app uses snappy in the code, define an appropriate library dependency to have it on classpath. Don't rely on transitive dependencies. Jacek On 7 Apr 2017 8:34 a.m., "satishl" wrote: Hi, I am planning to process spark app eventlogs with another spark

Re: Error while reading the CSV

2017-04-07 Thread Praneeth Gayam
Try the following spark-shell --master yarn-client --name nayan /opt/packages/-data- prepration/target/scala-2.10/-data-prepration-assembly-1.0.jar On Thu, Apr 6, 2017 at 6:36 PM, nayan sharma wrote: > Hi All, > I am getting error while loading CSV file. > >

Re: Hi

2017-04-07 Thread kant kodali
oops sorry. Please ignore this. wrong mailing list

Hi

2017-04-07 Thread kant kodali
Hi All, I read the docs however I still have the following question For Stateful stream processing is HDFS mandatory? because In some places I see it is required and other places I see that rocksDB can be used. I just want to know if HDFS is mandatory for Stateful stream processing? Thanks!

Re: What is the best way to run a scheduled spark batch job on AWS EC2 ?

2017-04-07 Thread Sam Elamin
Hi Shyla You have multiple options really some of which have been already listed but let me try and clarify Assuming you have a spark application in a jar you have a variety of options You have to have an existing spark cluster that is either running on EMR or somewhere else. *Super simple /

Re: Error while reading the CSV

2017-04-07 Thread Yash Sharma
Sorry buddy, didn't get your question quite right. Just to test, I created a scala class with spark csv and it seemed to work. Donno if that would help much, but here are the env details: EMR 2.7.3 scalaVersion := "2.11.8" Spark version 2.0.2 On Fri, 7 Apr 2017 at 17:51 nayan sharma

Re: Error while reading the CSV

2017-04-07 Thread nayan sharma
Hi Yash, I know this will work perfect but here I wanted to read the csv using the assembly jar file. Thanks, Nayan > On 07-Apr-2017, at 10:02 AM, Yash Sharma wrote: > > Hi Nayan, > I use the --packages with the spark shell and the spark submit. Could you > please try

reading snappy eventlog files from hdfs using spark

2017-04-07 Thread satishl
Hi, I am planning to process spark app eventlogs with another spark app. These event logs are saved with snappy compression (extension: .snappy). When i read the file in a new spark app - i get a snappy library not found error. I am confused as to how can spark write eventlog in snappy format

Re: is there a way to persist the lineages generated by spark?

2017-04-07 Thread kant kodali
yes Lineage that is actually replayable is what is needed for Validation process. So we can address questions like how a system arrived at a state S at a time T. I guess a good analogy is event sourcing. On Thu, Apr 6, 2017 at 10:30 PM, Jörn Franke wrote: > I do think