Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mike Metzger
Hi Kevin - There's not really a race condition as the 64 bit value is split into a 31 bit partition id (the upper portion) and a 33 bit incrementing id. In other words, as long as each partition contains fewer than 8 billion entries there should be no overlap and there is not any

Re: difference between package and jar Option in Spark

2016-09-04 Thread Tal Grynbaum
You need to download all the dependencies of that jar as well On Mon, Sep 5, 2016, 06:59 Divya Gehlot wrote: > Hi, > I am using spark-csv to parse my input files . > If I use --package option it works fine but if I download >

Re: difference between package and jar Option in Spark

2016-09-04 Thread Divya Gehlot
Hi, I am using spark-csv to parse my input files . If I use --package option it works fine but if I download the jar and use --jars option Its throwing Class not found exception. Thanks, Divya On 1 September 2016 at 17:26,

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Holden Karau
You really shouldn't mix different versions of Spark between the master and worker nodes, if your going to upgrade - upgrade all of them. Otherwise you may get very confusing failures. On Monday, September 5, 2016, Rex X wrote: > Wish to use the Pivot Table feature of data

Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread 刘虓
Hi, I think you can refer to spark history server to figure out how the time was spent. 2016-09-05 10:36 GMT+08:00 xiefeng : > The spark context will be reused, so the spark context initialization won't > affect the throughput test. > > > > -- > View this message in

Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
The spark context will be reused, so the spark context initialization won't affect the throughput test. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Why-does-spark-take-so-much-time-for-simple-task-without-calculation-tp27628p27657.html Sent from the

Re: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread xiefeng
My Detail test process: 1. In initialization, it will create 100 string RDDs and distribute them in spark workers. for (int i = 1; i <= numOfRDDs; i++) { JavaRDD rddData = sc.parallelize(Arrays.asList(Integer.toString(i))).coalesce(1);

RE: Why does spark take so much time for simple task without calculation?

2016-09-04 Thread Xie, Feng
Hi Aliaksandr, Thank you very much for your answer. And in my test, I would reuse the spark context, it is initialized when I start the application, for the later throughput test, it won't be initialized again. And when I increase the number of workers, the through put doesn't increase. I read

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi Mich, Thank you for your input. Does monotonically incremental ensure about race condition and does it duplicates the ids at some points with multi threads, multi instances, ... ? Even System.currentTimeMillis() still has duplication? Cheers, Kevin. On Mon, Sep 5, 2016 at 12:30 AM, Mich

Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
Hi Russell. if possible pleae help me to solve the below issue. val df = sqlContext.read. format("org.apache.spark.sql.cassandra"). options(Map("c_table"->"restt","keyspace"->"sss")). load() com.datastax.driver.core.TransportException: [/192.23.2.100:9042] Cannot connect at

Problem in accessing swebhdfs

2016-09-04 Thread Sourav Mazumder
Hi, When I try to access a swebhdfs uri I get following error. In my hadoop cluster webhdfs is enabled. Also I can access the same resource using webhdfs API from a http client with SSL. Any idea what is going wring ? Regards, Sourav java.io.IOException: Unexpected HTTP response: code=404 !=

Reuters Market Data System connection to Spark Streaming

2016-09-04 Thread Mich Talebzadeh
Hi, Has anyone had experience of using such messaging system like Kafka to connect Reuters Market Data System to Spark Streaming by any chance. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Resources for learning Spark administration

2016-09-04 Thread Mich Talebzadeh
Hi, There are a lof of stuff to cover here depending on the business and your needs Do you mean: 1. Hardware spec for Spark master and nodes 2. The number of nodes, How to scale the nodes 3. Where to set up Spark nodes, on the same Hardware nodes as HDFS (assuming using Hadoop) or

Resources for learning Spark administration

2016-09-04 Thread Somasundaram Sekar
Please suggest some good resources to learn Spark administration.

Re: S3A + EMR failure when writing Parquet?

2016-09-04 Thread Everett Anderson
Hey, Thanks for the reply and sorry for the late response! I haven't been able to figure out the root cause, but I have been able to get things working if both the cluster and the remote submitter use S3A instead of EMRFS for all s3:// interactions, so I'm going with that, for now. My

Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
This would also be a better question for the SCC user list :) https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user On Sun, Sep 4, 2016 at 9:31 AM Russell Spitzer wrote: > >

Re: spark cassandra issue

2016-09-04 Thread Russell Spitzer
https://github.com/datastax/spark-cassandra-connector/blob/v1.3.1/doc/14_data_frames.md In Spark 1.3 it was illegal to use "table" as a key in Spark SQL so in that version of Spark the connector needed to use the option "c_table" val df = sqlContext.read. |

Re: Spark transformations

2016-09-04 Thread janardhan shetty
In scala Spark ML Dataframes. On Sun, Sep 4, 2016 at 9:16 AM, Somasundaram Sekar < somasundar.se...@tigeranalytics.com> wrote: > Can you try this > > https://www.linkedin.com/pulse/hive-functions-udfudaf- > udtf-examples-gaurav-singh > > On 4 Sep 2016 9:38 pm, "janardhan shetty"

Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-04 Thread Rex X
Wish to use the Pivot Table feature of data frame which is available since Spark 1.6. But the spark of current cluster is version 1.5. Can we install Spark 2.0 on the master node to work around this? Thanks!

Re: Spark transformations

2016-09-04 Thread Somasundaram Sekar
Can you try this https://www.linkedin.com/pulse/hive-functions-udfudaf-udtf-examples-gaurav-singh On 4 Sep 2016 9:38 pm, "janardhan shetty" wrote: > Hi, > > Is there any chance that we can send entire multiple columns to an udf and > generate a new column for Spark ML.

Spark transformations

2016-09-04 Thread janardhan shetty
Hi, Is there any chance that we can send entire multiple columns to an udf and generate a new column for Spark ML. I see similar approach as VectorAssembler but not able to use few classes /traitslike HasInputCols, HasOutputCol, DefaultParamsWritable since they are private. Any leads/examples is

Re: spark cassandra issue

2016-09-04 Thread Mich Talebzadeh
and your Cassandra table is there etc? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:*

Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
Hey Mich, I am using the same one right now. Thanks for the reply. import org.apache.spark.sql.cassandra._ import com.datastax.spark.connector._ //Loads implicit functions sc.cassandraTable("keyspace name", "table name") On Sun, Sep 4, 2016 at 8:48 PM, Mich Talebzadeh

Re: spark cassandra issue

2016-09-04 Thread Mich Talebzadeh
Hi Selvan. I don't deal with Cassandra but have you tried other options as described here https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md To get a Spark RDD that represents a Cassandra table, call the cassandraTable method on the SparkContext object. import

Re: spark cassandra issue

2016-09-04 Thread Selvam Raman
its very urgent. please help me guys. On Sun, Sep 4, 2016 at 8:05 PM, Selvam Raman wrote: > Please help me to solve the issue. > > spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.3.0 > --conf spark.cassandra.connection.host=** > > val df =

Generating random Data using Spark and saving it to table, views appreciated

2016-09-04 Thread Mich Talebzadeh
Hi All, The following code creates an array of certain rows in Spark and saves the output into a Hive ORC table. You can save it in whatever format you prefer. I wanted to create generic test data in Spark. It is not something standard but had similar approach for Oracle. It is a cooked up stuff

spark cassandra issue

2016-09-04 Thread Selvam Raman
Please help me to solve the issue. spark-shell --packages com.datastax.spark:spark-cassandra-connector_2.10:1.3.0 --conf spark.cassandra.connection.host=** val df = sqlContext.read. | format("org.apache.spark.sql.cassandra"). | options(Map( "table" -> "", "keyspace" -> "***")).

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Mich Talebzadeh
You can create a monotonically incrementing ID column on your table scala> val ll_18740868 = spark.table("accounts.ll_18740868") scala> val startval = 1 scala> val df = ll_18740868.withColumn("id", *monotonically_increasing_id()+* startval).show (2)

Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone, Please give me your opinions on what is the best ID Generator for ID field in parquet ? UUID.randomUUID(); AtomicReference currentTime = new AtomicReference<>(System.currentTimeMillis()); AtomicLong counter = new AtomicLong(0); Thanks, Kevin.

How does chaining of Windowed Dstreams work?

2016-09-04 Thread Hemalatha A
Hello, I have a set of Dstreams on which I'm performing some computation on each Dstreams which is widowed on one other from the base stream based on the order of window intervals. I want to find out the best Strem on which I could window a particular stream on? Suppose, I have a spark Dstream,

Creating a UDF/UDAF using code generation

2016-09-04 Thread AssafMendelson
Hi, I want to write a UDF/UDAF which provides native processing performance. Currently, when creating a UDF/UDAF in a normal manner the performance is hit because it breaks optimizations. I tried something like this: import org.apache.spark.sql.catalyst.InternalRow import

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I don’t have anything off the hand (Unfortunately I didn’t really save it) but you can easily make some toy examples. For example you might do something like defining a simple UDF (e.g. test if number < 10) Then create the function in scala: package com.example import

Re: Scala Vs Python

2016-09-04 Thread Simon Edelhaus
Any thoughts about Spark and Erlang? -- ttfn Simon Edelhaus California 2016 On Sun, Sep 4, 2016 at 1:00 AM, ayan guha wrote: > Hi > > This one is quite interesting. Is it possible to share few toy examples? > > On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson

Re: Scala Vs Python

2016-09-04 Thread ayan guha
Hi This one is quite interesting. Is it possible to share few toy examples? On Sun, Sep 4, 2016 at 5:23 PM, AssafMendelson wrote: > I am not aware of any official testing but you can easily create your own. > > In testing I made I saw that python UDF were more than 10

RE: Scala Vs Python

2016-09-04 Thread AssafMendelson
I am not aware of any official testing but you can easily create your own. In testing I made I saw that python UDF were more than 10 times slower than scala UDF (and in some cases it was closer to 50 times slower). That said, it would depend on how you use your UDF. For example, lets say you have

Re: Creating RDD using swebhdfs with truststore

2016-09-04 Thread Denis Bolshakov
Hello, I would also set java opts for driver. Best regards, Denis 4 Сен 2016 г. 0:31 пользователь "Sourav Mazumder" < sourav.mazumde...@gmail.com> написал: > Hi, > > I am trying to create a RDD by using swebhdfs to a remote hadoop cluster > which is protected by Knox and uses SSL. > > The code