RE: Strongly Connected Components

2016-11-11 Thread Shreya Agarwal
Thanks for the detailed response ☺ I will try the things you mentioned! From: Daniel Darabos [mailto:daniel.dara...@lynxanalytics.com] Sent: Friday, November 11, 2016 4:59 PM To: Shreya Agarwal Cc: Felix Cheung ; user@spark.apache.org; Denny Lee

Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
All right thanks for inputs is there any way spark can process all combination parallel in one job ? If is it ok to load the input csv file in dataframe and use flat map to create key pair, then use reduceByKey to sum the double array? I believe that will work same like agg function which you

Exception not failing Python applications (in yarn client mode) - SparkLauncher says app succeeded, where app actually has failed

2016-11-11 Thread Elkhan Dadashov
Hi, *Problem*: Spark job fails, but RM page says the job succeeded, also appHandle = sparkLauncher.startApplication() ... appHandle.getState() returns Finished state - which indicates The application finished with a successful status, whereas the Spark job actually failed. *Environment*:

SparkDriver memory calculation mismatch

2016-11-11 Thread Elkhan Dadashov
Hi, Spark website indicates default spark properties as like this: I did not override any properties in spark-defaults.conf file, but when I launch Spark in YarnClient mode: spark.driver.memory 1g spark.yarn.am.memory 512m

Re: Strongly Connected Components

2016-11-11 Thread Daniel Darabos
Hi Shreya, GraphFrames just calls the GraphX strongly connected components code. ( https://github.com/graphframes/graphframes/blob/release-0.2.0/src/main/scala/org/graphframes/lib/StronglyConnectedComponents.scala#L51 ) For choosing the number of iterations: If the number of iterations is less

pyspark: accept unicode column names in DataFrame.corr and cov

2016-11-11 Thread SamPenrose
The corr() and cov() methods of DataFrame require an instance of str for column names: . https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1356 although instances of basestring appear to work for addressing columns: .

Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread ayan guha
You can explore grouping sets in SQL and write an aggregate function to add array wise sum. It will boil down to something like Select attr1,attr2...,yourAgg(Val) >From t Group by attr1,attr2... Grouping sets((attr1,attr2),(aytr1)) On 12 Nov 2016 04:57, "Anil Langote"

appHandle.kill(), SparkSubmit Process, JVM questions related to SparkLauncher design and Spark Driver

2016-11-11 Thread Elkhan Dadashov
Few more questions to Marcelo. Sorry Marcelo, for very long question list. I'd really appreciate your kind help and answer to these questions in order to fully understand design decision and architecture you have in mind while implementing very helpful SparkLauncher. *Scenario*: Spark job is

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
I really don't see why one wants to set up streaming replication unless for situations where similar functionality to transactional databases is required in big data? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nicholas Sharkey
I did get *some* help from DataBricks in terms of programmatically grabbing the categorical variables but I can't figure out where to go from here: *# Get all string cols/categorical cols* *stringColList = [i[0] for i in df.dtypes if i[1] == 'string']* *# generate OHEs for every col in

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-11 Thread Cody Koeninger
It is already documented that you must use a different group id, which as far as I can tell you are still not doing. On Nov 10, 2016 7:43 PM, "Shixiong(Ryan) Zhu" wrote: > Yeah, the KafkaRDD cannot be reused. It's better to document it. > > On Thu, Nov 10, 2016 at 8:26

Re: Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nick Pentreath
For now OHE supports a single column. So you have to have 1000 OHE in a pipeline. However you can add them programatically so it is not too bad. If the cardinality of each feature is quite low, it should be workable. After that user VectorAssembler to stitch the vectors together (which accepts

RE: Strongly Connected Components

2016-11-11 Thread Shreya Agarwal
Tried GraphFrames. Still faced the same - job died after a few hours . The errors I see (And I see tons of them) are - (I ran with 3 times the partitions as well, which was 12 times number of executors , but still the same.) - ERROR NativeAzureFileSystem:

Re: Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Aniket Bhatnagar
Hi Shreya Initial partitions in the Datasets were more than 1000 and after a group by operation, the resultant Dataset had only 200 partitions (because by default number of partitions being set to 200). Any further operations on the resultant Dataset will have a maximum of 200 parallelism

Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread nsharkey
I have a dataset that I need to convert some of the the variables to dummy variables. The get_dummies function in Pandas works perfectly on smaller datasets but since it collects I'll always be bottlenecked by the master node. I've looked at Spark's OHE feature and while that will work in theory

RE: Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Shreya Agarwal
Curious – why do you want to repartition? Is there a subsequent step which fails because the number of partitions is less? Or you want to do it for a perf gain? Also, what were your initial Dataset partitions and how many did you have for the result of join? From: Aniket Bhatnagar

Finding a Spark Equivalent for Pandas' get_dummies

2016-11-11 Thread Nicholas Sharkey
I have a dataset that I need to convert some of the the variables to dummy variables. The get_dummies function in Pandas works perfectly on smaller datasets but since it collects I'll always be bottlenecked by the master node. I've looked at Spark's OHE feature and while that will work in theory

Re: TallSkinnyQR

2016-11-11 Thread Iman Mohtashemi
Yes this would be helpful, otherwise the Q part of the decomposition is useless. One can use that to solve the system by transposing it and multiplying with b and solving for x (Ax = b) where A = R and b = Qt*b since the Upper triangular matrix is correctly available (R) On Fri, Nov 11, 2016 at

DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
Hi All, I have been working on one use case and couldn’t able to think the better solution, I have seen you very active on spark user list please throw your thoughts on implementation. Below is the requirement. I have tried using dataset by splitting the double array column but it fails

RDD to HDFS - Kerberos - authentication error - RetryInvocationHandler

2016-11-11 Thread Gerard Casey
Hi all, I have an RDD that I wish to write to HDFS. data.saveAsTextFile("hdfs://path/vertices") This returns: WARN RetryInvocationHandler: Exception while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over null. Not retrying because try once and fail.

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
I think it differs as it starts streaming data through its own port as soon as the first block is landed. so the granularity is a block. however, think of it as oracle golden gate replication or sap replication for databases. the only difference is that if the corruption in the block with hdfs it

Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Aniket Bhatnagar
Hi I can't seem to find a way to pass number of partitions while join 2 Datasets or doing a groupBy operation on the Dataset. There is an option of repartitioning the resultant Dataset but it's inefficient to repartition after the Dataset has been joined/grouped into default number of partitions.

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
reason being ? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk. Any

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
Reason being you can set up hdfs duplication on your own to some other cluster . On Nov 11, 2016 22:42, "Mich Talebzadeh" wrote: > reason being ? > > Dr Mich Talebzadeh > > > > LinkedIn * >

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md Has all the information about Dataframes/ SparkSql On Fri, Nov 11, 2016 at 8:52 AM kant kodali wrote: > Wait I cannot create CassandraSQLContext from spark-shell. is this only > for

Re: Possible DR solution

2016-11-11 Thread Deepak Sharma
This is waste of money I guess. On Nov 11, 2016 22:41, "Mich Talebzadeh" wrote: > starts at $4,000 per node per year all inclusive. > > With discount it can be halved but we are talking a node itself so if you > have 5 nodes in primary and 5 nodes in DR we are talking

Re: Possible DR solution

2016-11-11 Thread Mich Talebzadeh
starts at $4,000 per node per year all inclusive. With discount it can be halved but we are talking a node itself so if you have 5 nodes in primary and 5 nodes in DR we are talking about $40K already. HTH Dr Mich Talebzadeh LinkedIn *

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Wait I cannot create CassandraSQLContext from spark-shell. is this only for enterprise versions? Thanks! On Fri, Nov 11, 2016 at 8:14 AM, kant kodali wrote: > https://academy.datastax.com/courses/ds320-analytics- > apache-spark/spark-sql-spark-sql-basics > > On Fri, Nov 11,

RE: Possible DR solution

2016-11-11 Thread Mudit Kumar
Is it feasible cost wise? Thanks, Mudit From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com] Sent: Friday, November 11, 2016 2:56 PM To: user @spark Subject: Possible DR solution Hi, Has anyone had experience of using WanDisco block replication to create a fault

Kafka Producer within a docker Instance

2016-11-11 Thread Raghav
Hi I run a spark job, where the executor is within a docker instance. I want to push the spark job output (one by one) to a Kafka broker which is outside the docker instance. Has anyone tried anything like this where Kafka producer is within a docker and broker is outside ? I am a newbie to

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
https://academy.datastax.com/courses/ds320-analytics-apache-spark/spark-sql-spark-sql-basics On Fri, Nov 11, 2016 at 8:11 AM, kant kodali wrote: > Hi, > > This is spark-cassandra-connector > but I am looking > more for

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
Hi, This is spark-cassandra-connector but I am looking more for how to use SPARK SQL and expose as a JDBC server for Cassandra. Thanks! On Fri, Nov 11, 2016 at 8:07 AM, Yong Zhang wrote: > Read the document on

Re: How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread Yong Zhang
Read the document on https://github.com/datastax/spark-cassandra-connector Yong From: kant kodali Sent: Friday, November 11, 2016 11:04 AM To: user @spark Subject: How to use Spark SQL to connect to Cassandra from Spark-Shell? How to use

How to use Spark SQL to connect to Cassandra from Spark-Shell?

2016-11-11 Thread kant kodali
How to use Spark SQL to connect to Cassandra from Spark-Shell? Any examples ? I use Java 8. Thanks! kant

Possible DR solution

2016-11-11 Thread Mich Talebzadeh
Hi, Has anyone had experience of using WanDisco block replication to create a fault tolerant solution to DR in Hadoop? The product claims that it starts replicating as soon as the first data block lands on HDFS and takes the block and sends it to DR/replicate site.

load large number of files from s3

2016-11-11 Thread Shawn Wan
Hi, We have 30 million small files (100k each) on s3. I want to know how bad it is to load them directly from s3 ( eg driver memory, io, executor memory, s3 reliability) before merge or distcp them. Anybody has experience? Thanks in advance! Regards, Shawn -- View this message in context:

load large number of files from s3

2016-11-11 Thread Xiaomeng Wan
Hi, We have 30 million small files (100k each) on s3. I want to know how bad it is to load them directly from s3 ( eg driver memory, io, executor memory, s3 reliability) before merge or distcp them. Anybody has experience? Thanks in advance! Regards, Shawn

Re: TallSkinnyQR

2016-11-11 Thread Sean Owen
@Xiangrui / @Joseph, do you think it would be reasonable to have CoordinateMatrix sort the rows it creates to make an IndexedRowMatrix? in order to make the ultimate output of toRowMatrix less surprising when it's not ordered? On Tue, Nov 8, 2016 at 3:29 PM Sean Owen wrote: