Re: combineByKey

2019-04-05 Thread Madabhattula Rajesh Kumar
Hi,

Thank you for the details. It is a typo error while composing the mail.
Below is the actual flow.

Any idea, why the combineByKey is not working. aggregateByKey is working.

//Defining createCombiner, mergeValue and mergeCombiner functions

def createCombiner = (Id: String, value: String) => (value :: Nil).toSet

def mergeValue = (accumulator1: Set[String], accumulator2: (String,
String)) => accumulator1 ++ Set(accumulator2._2)

def mergeCombiner: (Set[String], Set[String]) => Set[String] =
(accumulator1: Set[String], accumulator2: Set[String]) => accumulator1 ++
accumulator2

sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.Id, (x.Id,
x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)

*Compile Error:-*
 found   : (String, String) => scala.collection.immutable.Set[String]
 required: ((String, String)) => ?
 sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.Id, (x.Id,
x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)

*aggregateByKey =>*

val result = sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.Id,
(x.Id, x.value))).aggregateByKey(Set[String]())(
(aggr, value) => aggr ++ Set(value._2),
(aggr1, aggr2) => aggr1 ++ aggr2).collect().toMap

 print(result)

Map(0-d1 -> Set(t1, t2, t3, t4), 0-d2 -> Set(t1, t5, t6, t2), 0-d3 ->
Set(t1, t2))

Regards,
Rajesh

On Fri, Apr 5, 2019 at 9:58 PM Jason Nerothin 
wrote:

> I broke some of your code down into the following lines:
>
> import spark.implicits._
>
> val a: RDD[Messages]= sc.parallelize(messages)
> val b: Dataset[Messages] = a.toDF.as[Messages]
> val c: Dataset[(String, (String, String))] = b.map{x => (x.timeStamp +
> "-" + x.Id, (x.Id, x.value))}
>
> You didn't capitalize .Id and your mergeValue0 and mergeCombiner don't
> have the types you think for the reduceByKey.
>
> I recommend breaking the code down like this to statement-by-statement
> when you get into a dance with the Scala type system.
>
> The type-safety that you're after (that eventually makes life *easier*) is
> best supported by Dataset (would have prevented the .id vs .Id error).
> Although there are some performance tradeoffs vs RDD and DataFrame...
>
>
>
>
>
>
> On Fri, Apr 5, 2019 at 2:11 AM Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> Any issue in the below code.
>>
>> case class Messages(timeStamp: Int, Id: String, value: String)
>>
>> val messages = Array(
>>   Messages(0, "d1", "t1"),
>>   Messages(0, "d1", "t1"),
>>   Messages(0, "d1", "t1"),
>>   Messages(0, "d1", "t1"),
>>   Messages(0, "d1", "t2"),
>>   Messages(0, "d1", "t2"),
>>   Messages(0, "d1", "t3"),
>>   Messages(0, "d1", "t4"),
>>   Messages(0, "d2", "t1"),
>>   Messages(0, "d2", "t1"),
>>   Messages(0, "d2", "t5"),
>>   Messages(0, "d2", "t6"),
>>   Messages(0, "d2", "t2"),
>>   Messages(0, "d2", "t2"),
>>   Messages(0, "d3", "t1"),
>>   Messages(0, "d3", "t1"),
>>   Messages(0, "d3", "t2")
>> )
>>
>> //Defining createCombiner, mergeValue and mergeCombiner functions
>> def createCombiner = (id: String, value: String) => Set(value)
>>
>> def mergeValue0 = (accumulator1: Set[String], accumulator2: (String,
>> String)) => accumulator1 ++ Set(accumulator2._2)
>>
>> def mergeCombiner: (Set[String], Set[String]) => Set[String] =
>> (accumulator1: Set[String], accumulator2: Set[String]) => accumulator1 ++
>> accumulator2
>>
>> sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id,
>> x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)
>>
>> *Compile Error:-*
>>  found   : (String, String) => scala.collection.immutable.Set[String]
>>  required: ((String, String)) => ?
>> sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id,
>> x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)
>>
>> Regards,
>> Rajesh
>>
>>
>
> --
> Thanks,
> Jason
>


combineByKey

2019-04-05 Thread Madabhattula Rajesh Kumar
Hi,

Any issue in the below code.

case class Messages(timeStamp: Int, Id: String, value: String)

val messages = Array(
  Messages(0, "d1", "t1"),
  Messages(0, "d1", "t1"),
  Messages(0, "d1", "t1"),
  Messages(0, "d1", "t1"),
  Messages(0, "d1", "t2"),
  Messages(0, "d1", "t2"),
  Messages(0, "d1", "t3"),
  Messages(0, "d1", "t4"),
  Messages(0, "d2", "t1"),
  Messages(0, "d2", "t1"),
  Messages(0, "d2", "t5"),
  Messages(0, "d2", "t6"),
  Messages(0, "d2", "t2"),
  Messages(0, "d2", "t2"),
  Messages(0, "d3", "t1"),
  Messages(0, "d3", "t1"),
  Messages(0, "d3", "t2")
)

//Defining createCombiner, mergeValue and mergeCombiner functions
def createCombiner = (id: String, value: String) => Set(value)

def mergeValue0 = (accumulator1: Set[String], accumulator2: (String,
String)) => accumulator1 ++ Set(accumulator2._2)

def mergeCombiner: (Set[String], Set[String]) => Set[String] =
(accumulator1: Set[String], accumulator2: Set[String]) => accumulator1 ++
accumulator2

sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id,
x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)

*Compile Error:-*
 found   : (String, String) => scala.collection.immutable.Set[String]
 required: ((String, String)) => ?
sc.parallelize(messages).map(x => (x.timeStamp+"-"+x.id, (x.id,
x.value))).combineByKey(createCombiner, mergeValue, mergeCombiner)

Regards,
Rajesh


spark jobserver

2017-03-05 Thread Madabhattula Rajesh Kumar
Hi,

I am getting below an exception when I start the job-server

./server_start.sh: line 41: kill: (11482) - No such process

Please let me know how to resolve this error

Regards,
Rajesh


SimpleConfigObject

2017-03-02 Thread Madabhattula Rajesh Kumar
Hi,

How to read json string from SimpleConfigObject.

SimpleConfigObject({"ID":"123","fileName":"123.txt"})


Regards,
Rajesh


Continuous or Categorical

2017-03-01 Thread Madabhattula Rajesh Kumar
Hi,

How to check given a set of values(example:- Column values in CSV file) are
Continuous or Categorical? Any statistical test is available?

Regards,
Rajesh


MultiLabelBinarizer

2017-02-08 Thread Madabhattula Rajesh Kumar
Hi,

Do we have a below equivalent preprocessing function in Spark ML?

from sklearn.preprocessing import MultiLabelBinarizer

Regards,

Rajesh


ML version of Kmeans

2017-01-31 Thread Madabhattula Rajesh Kumar
Hi,

I am not able to find predict method on "ML" version of Kmeans.

Mllib version has a predict method.  KMeansModel.predict(point: Vector)
.
How to predict the cluster point for new vectors in ML version of kmeans ?

Regards,
Rajesh


PrefixSpan

2017-01-24 Thread Madabhattula Rajesh Kumar
Hi,

Please point me the internal functionality of PrefixSpan with examples.

Regards,
Rajesh


Need help :- org.apache.spark.SparkException :- No such file or directory

2016-09-29 Thread Madabhattula Rajesh Kumar
Hi Team,

I getting below exception in spark jobs. Please let me know how to fix this
issue.

*Below is my cluster configuration :- *

I am using SparkJobServer to trigger the jobs. Below is my configuration in
SparkJobServer.

   - num-cpu-cores = 4
   - memory-per-node = 4G

I have a 4 workers in my cluster.


"result": {
"errorClass": "org.apache.spark.SparkException",
"cause":
"/tmp/spark-31a538f3-9451-4a2d-9123-00feb56c9e91/executor-73be6ffd-cd03-452a-bc99-a44290953d4f/spark-d0630f1f-e3df-4714-af30-4c839f6e3e8a/9400069401471754061530_lock
(No such file or directory)",
"stack": ["java.io.RandomAccessFile.open0(Native Method)",
"java.io.RandomAccessFile.open(RandomAccessFile.java:316)",
"java.io.RandomAccessFile.(RandomAccessFile.java:243)",
"org.apache.spark.util.Utils$.fetchFile(Utils.scala:373)",
"org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)",
"org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)",
"scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)",
"scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)",
"scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)",
"scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)",
"scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)",
"scala.collection.mutable.HashMap.foreach(HashMap.scala:98)",
"scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)",
"org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)",
"org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)",
"java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)",
"java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)",
"java.lang.Thread.run(Thread.java:745)"],
"causingClass": "java.io.FileNotFoundException",
"message": "Job aborted due to stage failure: Task 0 in stage 1286.0
failed 4 times, most recent failure: Lost task 0.3 in stage 1286.0 (TID
39149, svcjo-prd911.cisco.com): java.io.FileNotFoundException:
/tmp/spark-31a538f3-9451-4a2d-9123-00feb56c9e91/executor-73be6ffd-cd03-452a-bc99-a44290953d4f/spark-d0630f1f-e3df-4714-af30-4c839f6e3e8a/9400069401471754061530_lock
(No such file or directory)\n\tat java.io.RandomAccessFile.open0(Native
Method)\n\tat
java.io.RandomAccessFile.open(RandomAccessFile.java:316)\n\tat
java.io.RandomAccessFile.(RandomAccessFile.java:243)\n\tat
org.apache.spark.util.Utils$.fetchFile(Utils.scala:373)\n\tat
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)\n\tat
org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)\n\tat
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)\n\tat
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)\n\tat
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)\n\tat
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)\n\tat
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)\n\tat
scala.collection.mutable.HashMap.foreach(HashMap.scala:98)\n\tat
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)\n\tat
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)\n\tat
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)\n\tat
java.lang.Thread.run(Thread.java:745)\n\nDriver stacktrace:"
  },

Regards,
Rajesh


Re: LabeledPoint creation

2016-09-08 Thread Madabhattula Rajesh Kumar
Hi,

I have done this in different way. Please correct me, is this approach
right ?

val df = spark.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

   val categories: List[String] = List("a", "b", "c", "d")
val categoriesList: Array[Double] = new Array[Double](categories.size)
val labelPoint = df.rdd.map { line =>
  val values = line.getAs("category").toString()
  val id = line.getAs[java.lang.Integer]("id").toDouble
  var i = -1
  categories.foreach { x => i += 1; categoriesList(i) = if (x ==
values) 1.0 else 0.0 }
  val denseVector = Vectors.dense(categoriesList)
  LabeledPoint(id, denseVector)
}
labelPoint.foreach { x => println(x) }











*Output :-
(0.0,[1.0,0.0,0.0,0.0])(1.0,[0.0,1.0,0.0,0.0])(2.0,[0.0,0.0,1.0,0.0])(3.0,[1.0,0.0,0.0,0.0])(4.0,[1.0,0.0,0.0,0.0])(5.0,[0.0,0.0,1.0,0.0])(6.0,[0.0,0.0,0.0,1.0])*
Regards,
Rajesh


On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s <aka.f...@gmail.com> wrote:

> It has 4 categories
> a = 1 0 0
> b = 0 0 0
> c = 0 1 0
> d = 0 0 1
>
> --
> Oleksiy Dyagilev
>
> On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> Any help on above mail use case ?
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am new to Spark ML, trying to create a LabeledPoint from categorical
>>> dataset(example code from spark). For this, I am using One-hot encoding
>>> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code
>>>
>>> val df = sparkSession.createDataFrame(Seq(
>>>   (0, "a"),
>>>   (1, "b"),
>>>   (2, "c"),
>>>   (3, "a"),
>>>   (4, "a"),
>>>   (5, "c"),
>>>   (6, "d"))).toDF("id", "category")
>>>
>>> val indexer = new StringIndexer()
>>>   .setInputCol("category")
>>>   .setOutputCol("categoryIndex")
>>>   .fit(df)
>>>
>>> val indexed = indexer.transform(df)
>>>
>>> indexed.select("category", "categoryIndex").show()
>>>
>>> val encoder = new OneHotEncoder()
>>>   .setInputCol("categoryIndex")
>>>   .setOutputCol("categoryVec")
>>> val encoded = encoder.transform(indexed)
>>>
>>>  encoded.select("id", "category", "categoryVec").show()
>>>
>>> *Output :- *
>>> +---++-+
>>> | id|category|  categoryVec|
>>> +---++-+
>>> |  0|   a|(3,[0],[1.0])|
>>> |  1|   b|(3,[],[])|
>>> |  2|   c|(3,[1],[1.0])|
>>> |  3|   a|(3,[0],[1.0])|
>>> |  4|   a|(3,[0],[1.0])|
>>> |  5|   c|(3,[1],[1.0])|
>>> |  6|   d|(3,[2],[1.0])|
>>> +---++-+
>>>
>>> *Creating LablePoint from encoded dataframe:-*
>>>
>>> val data = encoded.rdd.map { x =>
>>>   {
>>> val featureVector = Vectors.dense(x.getAs[org.apac
>>> he.spark.ml.linalg.SparseVector]("categoryVec").toArray)
>>> val label = x.getAs[java.lang.Integer]("id").toDouble
>>> LabeledPoint(label, featureVector)
>>>   }
>>> }
>>>
>>> data.foreach { x => println(x) }
>>>
>>> *Output :-*
>>>
>>> (0.0,[1.0,0.0,0.0])
>>> (1.0,[0.0,0.0,0.0])
>>> (2.0,[0.0,1.0,0.0])
>>> (3.0,[1.0,0.0,0.0])
>>> (4.0,[1.0,0.0,0.0])
>>> (5.0,[0.0,1.0,0.0])
>>> (6.0,[0.0,0.0,1.0])
>>>
>>> I have a four categorical values like a, b, c, d. I am expecting 4
>>> features in the above LablePoint but it has only 3 features.
>>>
>>> Please help me to creation of LablePoint from categorical features.
>>>
>>> Regards,
>>> Rajesh
>>>
>>>
>>>
>>
>


Forecasting algorithms in spark ML

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi,

Please let me know supported Forecasting algorithms in spark ML

Regards,
Rajesh


Re: LabeledPoint creation

2016-09-07 Thread Madabhattula Rajesh Kumar
Hi,

Any help on above mail use case ?

Regards,
Rajesh

On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I am new to Spark ML, trying to create a LabeledPoint from categorical
> dataset(example code from spark). For this, I am using One-hot encoding
> <http://en.wikipedia.org/wiki/One-hot> feature. Below is my code
>
> val df = sparkSession.createDataFrame(Seq(
>   (0, "a"),
>   (1, "b"),
>   (2, "c"),
>   (3, "a"),
>   (4, "a"),
>   (5, "c"),
>   (6, "d"))).toDF("id", "category")
>
> val indexer = new StringIndexer()
>   .setInputCol("category")
>   .setOutputCol("categoryIndex")
>   .fit(df)
>
> val indexed = indexer.transform(df)
>
> indexed.select("category", "categoryIndex").show()
>
> val encoder = new OneHotEncoder()
>   .setInputCol("categoryIndex")
>   .setOutputCol("categoryVec")
> val encoded = encoder.transform(indexed)
>
>  encoded.select("id", "category", "categoryVec").show()
>
> *Output :- *
> +---++-+
> | id|category|  categoryVec|
> +---++-+
> |  0|   a|(3,[0],[1.0])|
> |  1|   b|(3,[],[])|
> |  2|   c|(3,[1],[1.0])|
> |  3|   a|(3,[0],[1.0])|
> |  4|   a|(3,[0],[1.0])|
> |  5|   c|(3,[1],[1.0])|
> |  6|   d|(3,[2],[1.0])|
> +---++-+
>
> *Creating LablePoint from encoded dataframe:-*
>
> val data = encoded.rdd.map { x =>
>   {
> val featureVector = Vectors.dense(x.getAs[org.
> apache.spark.ml.linalg.SparseVector]("categoryVec").toArray)
> val label = x.getAs[java.lang.Integer]("id").toDouble
> LabeledPoint(label, featureVector)
>   }
> }
>
> data.foreach { x => println(x) }
>
> *Output :-*
>
> (0.0,[1.0,0.0,0.0])
> (1.0,[0.0,0.0,0.0])
> (2.0,[0.0,1.0,0.0])
> (3.0,[1.0,0.0,0.0])
> (4.0,[1.0,0.0,0.0])
> (5.0,[0.0,1.0,0.0])
> (6.0,[0.0,0.0,1.0])
>
> I have a four categorical values like a, b, c, d. I am expecting 4
> features in the above LablePoint but it has only 3 features.
>
> Please help me to creation of LablePoint from categorical features.
>
> Regards,
> Rajesh
>
>
>


LabeledPoint creation

2016-09-06 Thread Madabhattula Rajesh Kumar
Hi,

I am new to Spark ML, trying to create a LabeledPoint from categorical
dataset(example code from spark). For this, I am using One-hot encoding
 feature. Below is my code

val df = sparkSession.createDataFrame(Seq(
  (0, "a"),
  (1, "b"),
  (2, "c"),
  (3, "a"),
  (4, "a"),
  (5, "c"),
  (6, "d"))).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")
  .fit(df)

val indexed = indexer.transform(df)

indexed.select("category", "categoryIndex").show()

val encoder = new OneHotEncoder()
  .setInputCol("categoryIndex")
  .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)

 encoded.select("id", "category", "categoryVec").show()

*Output :- *
+---++-+
| id|category|  categoryVec|
+---++-+
|  0|   a|(3,[0],[1.0])|
|  1|   b|(3,[],[])|
|  2|   c|(3,[1],[1.0])|
|  3|   a|(3,[0],[1.0])|
|  4|   a|(3,[0],[1.0])|
|  5|   c|(3,[1],[1.0])|
|  6|   d|(3,[2],[1.0])|
+---++-+

*Creating LablePoint from encoded dataframe:-*

val data = encoded.rdd.map { x =>
  {
val featureVector =
Vectors.dense(x.getAs[org.apache.spark.ml.linalg.SparseVector]("categoryVec").toArray)
val label = x.getAs[java.lang.Integer]("id").toDouble
LabeledPoint(label, featureVector)
  }
}

data.foreach { x => println(x) }

*Output :-*

(0.0,[1.0,0.0,0.0])
(1.0,[0.0,0.0,0.0])
(2.0,[0.0,1.0,0.0])
(3.0,[1.0,0.0,0.0])
(4.0,[1.0,0.0,0.0])
(5.0,[0.0,1.0,0.0])
(6.0,[0.0,0.0,1.0])

I have a four categorical values like a, b, c, d. I am expecting 4 features
in the above LablePoint but it has only 3 features.

Please help me to creation of LablePoint from categorical features.

Regards,
Rajesh


Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich,

Thank you

Regards,,
Rajesh

On Mon, Aug 15, 2016 at 6:35 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Ok Rajesh
>
> This is standalone.
>
> In that case it ought to be at least 4 connections as one executor will
> use one worker.
>
> I am hesitant in here as you can see with (at least) as with Standalone
> mode you may end up with more executors on each worker.
>
> But try it and see whether numPartitions" -> "4" is good or you can
> change this to something higher.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 15 August 2016 at 12:19, Madabhattula Rajesh Kumar <mrajaf...@gmail.com
> > wrote:
>
>> Hi Mich,
>>
>> Thank you for detailed explanation. One more question
>>
>> In my cluster, I have one master and 4 workers. In this case, 4
>> connections will be opened to Oracle ?
>>
>> Regards,
>> Rajesh
>>
>> On Mon, Aug 15, 2016 at 3:59 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> It happens that the number of parallel processes open from Spark to
>>> RDBMS is determined by the number of executors.
>>>
>>> I just tested this.
>>>
>>> With Yarn client using to executors I see two connections to RDBMS
>>>
>>>
>>> EXECUTIONS USERNAME   SID SERIAL# USERS_EXECUTING SQL_TEXT
>>> -- -- --- --- ---
>>> --
>>>  1 SCRATCHPAD 443   62565   1 SELECT
>>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>>
>>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>>   o_char(ID) AS ID,
>>> to_char(CLUSTERED) AS CLUSTERED,
>>>
>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>   ED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>   G FROM
>>> scratchpad.dummy) WHERE ID >= 2301 AND
>>>   ID < 2401
>>>  1 SCRATCHPAD 406   46793   1 SELECT
>>> "RANDOMISED","RANDOM_STRING","PADDING","CLU
>>>
>>> STERED","ID","SCATTERED","SMALL_VC" FROM (SELECT t
>>>   o_char(ID) AS ID,
>>> to_char(CLUSTERED) AS CLUSTERED,
>>>
>>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>>   ED) AS
>>> RANDOMISED, RANDOM_STRING, SMALL_VC, PADDIN
>>>   G FROM
>>> scratchpad.dummy) WHERE ID >= 2401 AND
>>>   ID < 2501
>>>
>>> So it  sounds like (can someone else independently confirm this) that
>>> regardless of what one specifies in "numPartitions" one ends up one
>>> connection from one Spark executor to RDBMS.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary dam

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
  | "numPartitions" -> "100",
>>  | "user" -> _username,
>>  | "password" -> _password)).load
>> s: org.apache.spark.sql.DataFrame = [ID: string, CLUSTERED: string ... 5
>> more fields]
>> scala> s.toJavaRDD.partitions.size()
>> res1: Int = 100
>>
>> This also seems to set the number of partitions. I still think that the
>> emphasis has to be on getting data from RDBMS as quickly as possible. The
>> partitioning does work. In below the login scratchpad has multiple
>> connections to Oracle and does the range selection OK
>>
>>  1 SCRATCHPAD  45   43048   1 SELECT
>> "SMALL_VC","CLUSTERED","PADDING","RANDOM_ST
>>
>> RING","ID","SCATTERED","RANDOMISED" FROM (SELECT t
>>   o_char(ID) AS ID,
>> to_char(CLUSTERED) AS CLUSTERED,
>>
>> to_char(SCATTERED) AS SCATTERED, to_char(RANDOMIS
>>   ED) AS RANDOMISED,
>> RANDOM_STRING, SMALL_VC, PADDIN
>>   G FROM
>> scratchpad.dummy)
>> *WHERE ID >= 1601
>> AND  ID < 1701*
>>
>> HTH
>>
>>
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 15 August 2016 at 08:18, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I would suggest you to look at sqoop as well. Essentially, you can
>>> provide a splitBy/partitionBy column using which data will be distributed
>>> among your stated number of mappers
>>>
>>> On Mon, Aug 15, 2016 at 5:07 PM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
>>>> Hi Mich,
>>>>
>>>> I have a below question.
>>>>
>>>> I want to join two tables and return the result based on the input
>>>> value. In this case, how we need to specify lower bound and upper bound
>>>> values ?
>>>>
>>>> select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
>>>> t1.transactionid=*1* and t1.id = t2.id
>>>>
>>>> *1 => dynamic input value.*
>>>>
>>>> Regards,
>>>> Rajesh
>>>>
>>>> On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> If you have your RDBMS table partitioned, then you need to consider
>>>>> how much data you want to extract in other words the result set returned 
>>>>> by
>>>>> the JDBC call.
>>>>>
>>>>> If you want all the data, then the number of partitions specified in
>>>>> the JDBC call should be equal to the number of partitions in your RDBMS
>>>>> table.
>>>>>
>>>>> HTH
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * 
>>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, 

Re: parallel processing with JDBC

2016-08-15 Thread Madabhattula Rajesh Kumar
Hi Mich,

I have a below question.

I want to join two tables and return the result based on the input value.
In this case, how we need to specify lower bound and upper bound values ?

select t1.id, t1.name, t2.course, t2.qualification from t1, t2 where
t1.transactionid=*1* and t1.id = t2.id

*1 => dynamic input value.*

Regards,
Rajesh

On Mon, Aug 15, 2016 at 12:05 PM, Mich Talebzadeh  wrote:

> If you have your RDBMS table partitioned, then you need to consider how
> much data you want to extract in other words the result set returned by the
> JDBC call.
>
> If you want all the data, then the number of partitions specified in the
> JDBC call should be equal to the number of partitions in your RDBMS table.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 August 2016 at 21:44, Ashok Kumar  wrote:
>
>> Thank you very much sir.
>>
>> I forgot to mention that two of these Oracle tables are range
>> partitioned. In that case what would be the optimum number of partitions if
>> you can share?
>>
>> Warmest
>>
>>
>> On Sunday, 14 August 2016, 21:37, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> If you have primary keys on these tables then you can parallelise the
>> process reading data.
>>
>> You have to be careful not to set the number of partitions too many.
>> Certainly there is a balance between the number of partitions supplied to
>> JDBC and the load on the network and the source DB.
>>
>> Assuming that your underlying table has primary key ID, then this will
>> create 20 parallel processes to Oracle DB
>>
>>  val d = HiveContext.read.format("jdbc").options(
>>  Map("url" -> _ORACLEserver,
>>  "dbtable" -> "(SELECT , , FROM )",
>>  "partitionColumn" -> "ID",
>>  "lowerBound" -> "1",
>>  "upperBound" -> "maxID",
>>  "numPartitions" -> "20",
>>  "user" -> _username,
>>  "password" -> _password)).load
>>
>> assuming your upper bound on ID is maxID
>>
>>
>> This will open multiple connections to RDBMS, each getting a subset of
>> data that you want.
>>
>> You need to test it to ensure that you get the numPartitions optimum and
>> you don't overload any component.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 14 August 2016 at 21:15, Ashok Kumar 
>> wrote:
>>
>> Hi,
>>
>> There are 4 tables ranging from 10 million to 100 million rows but they
>> all have primary keys.
>>
>> The network is fine but our Oracle is RAC and we can only connect to a
>> designated Oracle node (where we have a DQ account only).
>>
>> We have a limited time window of few hours to get the required data out.
>>
>> Thanks
>>
>>
>> On Sunday, 14 August 2016, 21:07, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>> How big are your tables and is there any issue with the network between
>> your Spark nodes and your Oracle DB that adds to issues?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * https://www.linkedin.com/ profile/view?id=
>> AAEWh2gBxianrbJd6zP6AcPCCd OABUrV8Pw
>> *
>>
>> http://talebzadehmich. wordpress.com
>> 
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>> On 14 August 2016 at 20:50, Ashok Kumar 
>> wrote:
>>
>> Hi Gurus,
>>
>> I have few large tables in rdbms (ours is Oracle). We want to access
>> these tables through Spark JDBC
>>
>> What is the quickest way of getting data into Spark Dataframe say
>> 

SparkSQL parallelism

2016-02-11 Thread Madabhattula Rajesh Kumar
Hi,

I have a spark cluster with One Master and 3 worker nodes. I have written a
below code to fetch the records from oracle using sparkSQL

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val employees = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:oracle:thin:@:1525:SID",
"dbtable" -> "(select * from employee where name like '%18%')",
"user" -> "username",
"password" -> "password")).load

I have a submitted this job to spark cluster using spark-submit command.



*Looks like, All 3 workers are executing same query and fetching same data.
It means, it is making 3 jdbc calls to oracle.*
*How to make this code to make a single jdbc call to oracle(In case of more
than one worker) ?*

Please help me to resolve this use case

Regards,
Rajesh


spark-cassandra

2016-02-03 Thread Madabhattula Rajesh Kumar
Hi,

I am using Spark Jobserver to submit the jobs. I am using spark-cassandra
connector to connect to Cassandra. I am getting below exception through
spak jobserver.

If I submit the job through *Spark-Submit *command it is working fine,.

Please let me know how to solve this issue


Exception in thread "pool-1-thread-1" java.lang.NoSuchMethodError:
com.datastax.driver.core.TableMetadata.getIndexes()Ljava/util/List;
at
com.datastax.spark.connector.cql.Schema$.getIndexMap(Schema.scala:193)
at
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchPartitionKey(Schema.scala:197)
at
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:239)
at
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchTables$1$2.apply(Schema.scala:238)
at
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchTables$1(Schema.scala:238)
at
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:247)
at
com.datastax.spark.connector.cql.Schema$$anonfun$com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1$2.apply(Schema.scala:246)
at
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at
scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153)
at
scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306)
at
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at
com.datastax.spark.connector.cql.Schema$.com$datastax$spark$connector$cql$Schema$$fetchKeyspaces$1(Schema.scala:246)
at
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:252)
at
com.datastax.spark.connector.cql.Schema$$anonfun$fromCassandra$1.apply(Schema.scala:249)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:121)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withClusterDo$1.apply(CassandraConnector.scala:120)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:110)
at
com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:109)
at
com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:139)
at
com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:109)
at
com.datastax.spark.connector.cql.CassandraConnector.withClusterDo(CassandraConnector.scala:120)
at
com.datastax.spark.connector.cql.Schema$.fromCassandra(Schema.scala:249)
at
com.datastax.spark.connector.writer.TableWriter$.apply(TableWriter.scala:263)
at
com.datastax.spark.connector.RDDFunctions.saveToCassandra(RDDFunctions.scala:36)
at
com.cisco.ss.etl.utils.ETLHelper$class.persistBackupConfigDevicesData(ETLHelper.scala:79)
at com.cisco.ss.etl.Main$.persistBackupConfigDevicesData(Main.scala:13)
at
com.cisco.ss.etl.utils.ETLHelper$class.persistByBacthes(ETLHelper.scala:43)
at com.cisco.ss.etl.Main$.persistByBacthes(Main.scala:13)
at com.cisco.ss.etl.Main$$anonfun$runJob$3.apply(Main.scala:48)
at com.cisco.ss.etl.Main$$anonfun$runJob$3.apply(Main.scala:45)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at com.cisco.ss.etl.Main$.runJob(Main.scala:45)
at com.cisco.ss.etl.Main$.runJob(Main.scala:13)
at
spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:274)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Regards,
Rajesh


SQL

2016-01-26 Thread Madabhattula Rajesh Kumar
Hi,

To read data from oracle. I am using sqlContext. Below is the method
signature.

Is lowerBound and upperBound values are belongs to actual table lower and
upper value of column (or) We can give any numbers

Please clarify.

sqlContext.read.format("jdbc").options(
  Map("url" -> "jdbcURL",
"dbtable" -> query,
"driver" -> "oracle.jdbc.driver.OracleDriver",
"user" -> "jdbcUsername",
"password" ->"jdbcPassword",
"numPartitions" -> "10",
"lowerBound" -> "1",
"upperBound" -> "1000",
"partitionColumn" -> "col1" )).load
  }

Regards,
Rajesh


Re: Clarification on Data Frames joins

2016-01-24 Thread Madabhattula Rajesh Kumar
Hi,

Any suggestions on this approach?

Regards,
Rajesh

On Sat, Jan 23, 2016 at 11:24 PM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I have a big database table(1 million plus records) in oracle. I need to
> query records based on input numbers. For this use case, I am doing below
> steps
>
> I am creating two data frames.
>
> DF1 = I am computing this DF1 using sql query. It has one million +
> records.
>
> DF2 = I have a list of numbers. I am converting list of input numbers to
> data-frame
>
> I am converting DF1 and DF2 to register temp table and forming sql query.
> It will return input number of records
>
> Steps :-
>
> DF1.registerTempTable("E1")
>
> DF2.registerTempTable("E2")
>
> DF3 = sqlContext.sql(select * from E1, E2 where E1.id = E2.id)
>
> DF3.map(row => (row(0),row(1),row(2))).saveToCassandra(keyspace, table1)
>
> *query :-*
>
> How D3 records will fetch?
>
> Is DF1 load entire table data(1 million plus records) into memory when
> joining with DF2 ? (Or) It will fetch only DF2 matched records from oracle
> and load into memory.
>
> Please clarify and let me know my approach is correct
>
> Regards,
> Rajesh
>
>
>
>
>
>
>
>


Clarification on Data Frames joins

2016-01-23 Thread Madabhattula Rajesh Kumar
Hi,

I have a big database table(1 million plus records) in oracle. I need to
query records based on input numbers. For this use case, I am doing below
steps

I am creating two data frames.

DF1 = I am computing this DF1 using sql query. It has one million +
records.

DF2 = I have a list of numbers. I am converting list of input numbers to
data-frame

I am converting DF1 and DF2 to register temp table and forming sql query.
It will return input number of records

Steps :-

DF1.registerTempTable("E1")

DF2.registerTempTable("E2")

DF3 = sqlContext.sql(select * from E1, E2 where E1.id = E2.id)

DF3.map(row => (row(0),row(1),row(2))).saveToCassandra(keyspace, table1)

*query :-*

How D3 records will fetch?

Is DF1 load entire table data(1 million plus records) into memory when
joining with DF2 ? (Or) It will fetch only DF2 matched records from oracle
and load into memory.

Please clarify and let me know my approach is correct

Regards,
Rajesh


Re: Concurrent Spark jobs

2016-01-19 Thread Madabhattula Rajesh Kumar
Hi,

Just a thought. Can we use Spark Job Server and trigger jobs through rest
apis. In this case, all jobs will share same context and run the jobs
parallel.

If any one has other thoughts please share

Regards,
Rajesh

On Tue, Jan 19, 2016 at 10:28 PM, emlyn  wrote:

> We have a Spark application that runs a number of ETL jobs, writing the
> outputs to Redshift (using databricks/spark-redshift). This is triggered by
> calling DataFrame.write.save on the different DataFrames one after another.
> I noticed that during the Redshift load while the output of one job is
> being
> loaded into Redshift (which can take ~20 minutes for some jobs), the
> cluster
> is sitting idle.
>
> In order to maximise the use of the cluster, we tried starting a thread for
> each job so that they can all be submitted simultaneously, and therefore
> the
> cluster can be utilised by another job while one is being written to
> Redshift.
>
> However, when this is run, it fails with a TimeoutException (see stack
> trace
> below). Would it make sense to increase "spark.sql.broadcastTimeout"? I'm
> not sure that would actually solve anything. Should it not be possible to
> save multiple DataFrames simultaneously? Or any other hints on how to make
> better use of the cluster's resources?
>
> Thanks.
>
>
> Stack trace:
>
> Exception in thread "main" java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Futures timed out after [300
> seconds]
> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
> ...
> at
>
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> ...
> Caused by: java.util.concurrent.TimeoutException: Futures timed out after
> [300 seconds]
> at
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
> at
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at
>
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at
>
> org.apache.spark.sql.execution.joins.BroadcastHashOuterJoin.doExecute(BroadcastHashOuterJoin.scala:113)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
> org.apache.spark.sql.execution.Project.doExecute(basicOperators.scala:46)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at
>
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at
> org.apache.spark.sql.DataFrame.rdd$lzycompute(DataFrame.scala:1676)
> at org.apache.spark.sql.DataFrame.rdd(DataFrame.scala:1673)
> at
> org.apache.spark.sql.DataFrame.mapPartitions(DataFrame.scala:1465)
> at
>
> com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:264)
> at
>
> com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:374)
> at
>
> com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:106)
> at
>
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
> at
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Concurrent-Spark-jobs-tp26011.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark job server

2016-01-16 Thread Madabhattula Rajesh Kumar
Hi,

I am not able to start spark job sever. I am facing below error. Please let
me know, how to resolve this issue.

I have configured one master and two workers in cluster mode.

./server_start.sh


*./server_start.sh: line 52: kill: (19621) - No such
process./server_start.sh: line 78:
/home/spark-1.5.2-bin-hadoop2.6/bin/compute-classpath.sh: No such file or
directory*
Regards,
Rajesh


Re: spark job server

2016-01-16 Thread Madabhattula Rajesh Kumar
Hi,

I am using "ooyala/spark-jobserver".

Regards,
Rajesh

On Sat, Jan 16, 2016 at 8:36 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Which distro are you using ?
>
> From the error message, compute-classpath.sh was not found.
> I searched Spark 1.6 built for hadoop 2.6 but didn't find
> either compute-classpath.sh or server_start.sh
>
> Cheers
>
> On Sat, Jan 16, 2016 at 5:33 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I am not able to start spark job sever. I am facing below error. Please
>> let me know, how to resolve this issue.
>>
>> I have configured one master and two workers in cluster mode.
>>
>> ./server_start.sh
>>
>>
>> *./server_start.sh: line 52: kill: (19621) - No such
>> process./server_start.sh: line 78:
>> /home/spark-1.5.2-bin-hadoop2.6/bin/compute-classpath.sh: No such file or
>> directory*
>> Regards,
>> Rajesh
>>
>
>


java.sql.SQLException: Unsupported type -101

2015-12-25 Thread Madabhattula Rajesh Kumar
Hi

I'm not able to read "Oracle Table - TIMESTAMP(6) WITH TIME ZONE datatype"
column using Spark SQL.

I'm getting below exception. Please let me know how to resolve this issue.

*Exception :-*

Exception in thread "main" java.sql.SQLException: Unsupported type -101
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.org$apache$spark$sql$execution$datasources$jdbc$JDBCRDD$$getCatalystType(JDBCRDD.scala:103)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$1.apply(JDBCRDD.scala:140)
at scala.Option.getOrElse(Option.scala:120)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:139)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)

Regards,
Rajesh


Re: Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar
Hi Ted,

Thank you. Yes. This issue is related to
https://issues.apache.org/jira/browse/SPARK-4170

Regards,
Rajesh

On Wed, Dec 23, 2015 at 12:09 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> This should be related:
> https://issues.apache.org/jira/browse/SPARK-4170
>
> On Tue, Dec 22, 2015 at 9:34 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I have a standalone cluster. One Master + One Slave. I'm getting below
>> "NULL POINTER" exception.
>>
>> Could you please help me on this issue.
>>
>>
>> *Code Block :-*
>>  val accum = sc.accumulator(0)
>> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) *==> This
>> line giving exception.*
>>
>> Exception :-
>>
>> 15/12/22 09:18:26 WARN scheduler.TaskSetManager: Lost task 1.0 in stage
>> 0.0 (TID 1, 172.25.111.123): *java.lang.NullPointerException*
>> at com.cc.ss.etl.Main$$anonfun$1.apply$mcVI$sp(Main.scala:25)
>> at com.cc.ss.etl.Main$$anonfun$1.apply(Main.scala:25)
>> at com.cc.ss.etl.Main$$anonfun$1.apply(Main.scala:25)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> 15/12/22 09:18:26 INFO scheduler.TaskSetManager: Lost task 0.0 in stage
>> 0.0 (TID 0) on executor 172.25.111.123: java.lang.NullPointerException
>> (null) [duplicate 1]
>> 15/12/22 09:18:26 INFO scheduler.TaskSetManager: Starting task 0.1 in
>> stage 0.0 (TID 2, 172.25.111.123, PROCESS_LOCAL, 2155 bytes)
>> 15/12/22 09:18:26 INFO scheduler.TaskSetManager: Starting task 1.1 in
>> stage 0.0 (TID 3, 172.25.111.123, PROCESS_LOCAL, 2155 bytes)
>> 15/12/22 09:18:26 WARN scheduler.TaskSetManager: Lost task 1.1 in stage
>> 0.0 (TID 3, 172.25.111.123):
>>
>> Regards,
>> Rajesh
>>
>
>


Stand Alone Cluster - Strange issue

2015-12-22 Thread Madabhattula Rajesh Kumar
Hi,

I have a standalone cluster. One Master + One Slave. I'm getting below
"NULL POINTER" exception.

Could you please help me on this issue.


*Code Block :-*
 val accum = sc.accumulator(0)
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) *==> This line
giving exception.*

Exception :-

15/12/22 09:18:26 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 0.0
(TID 1, 172.25.111.123): *java.lang.NullPointerException*
at com.cc.ss.etl.Main$$anonfun$1.apply$mcVI$sp(Main.scala:25)
at com.cc.ss.etl.Main$$anonfun$1.apply(Main.scala:25)
at com.cc.ss.etl.Main$$anonfun$1.apply(Main.scala:25)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:890)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1848)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/12/22 09:18:26 INFO scheduler.TaskSetManager: Lost task 0.0 in stage 0.0
(TID 0) on executor 172.25.111.123: java.lang.NullPointerException (null)
[duplicate 1]
15/12/22 09:18:26 INFO scheduler.TaskSetManager: Starting task 0.1 in stage
0.0 (TID 2, 172.25.111.123, PROCESS_LOCAL, 2155 bytes)
15/12/22 09:18:26 INFO scheduler.TaskSetManager: Starting task 1.1 in stage
0.0 (TID 3, 172.25.111.123, PROCESS_LOCAL, 2155 bytes)
15/12/22 09:18:26 WARN scheduler.TaskSetManager: Lost task 1.1 in stage 0.0
(TID 3, 172.25.111.123):

Regards,
Rajesh


Re: spark-submit for dependent jars

2015-12-21 Thread Madabhattula Rajesh Kumar
Hi Jeff and Satish,

I have modified script and executed. Please find below command

./spark-submit --master local  --class test.Main --jars
/home/user/download/jar/ojdbc7.jar
/home//test/target/spark16-0.0.1-SNAPSHOT.jar

Still I'm getting same exception.


Exception in thread "main" java.sql.SQLException: No suitable driver found
for jdbc:oracle:thin:@:1521:xxx
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at com.cisco.ss.etl.utils.ETLHelper$class.getData(ETLHelper.scala:22)
at com.cisco.ss.etl.Main$.getData(Main.scala:9)
at com.cisco.ss.etl.Main$delayedInit$body.apply(Main.scala:13)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.cisco.ss.etl.Main$.main(Main.scala:9)
at com.cisco.ss.etl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Regards,
Rajesh

On Mon, Dec 21, 2015 at 7:18 PM, satish chandra j <jsatishchan...@gmail.com>
wrote:

> Hi Rajesh,
> Could you please try giving your cmd as mentioned below:
>
> ./spark-submit --master local  --class  --jars 
> 
>
> Regards,
> Satish Chandra
>
> On Mon, Dec 21, 2015 at 6:45 PM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> How to add dependent jars in spark-submit command. For example: Oracle.
>> Could you please help me to resolve this issue
>>
>> I have a standalone cluster. One Master and One slave.
>>
>> I have used below command it is not working
>>
>> ./spark-submit --master local  --class test.Main
>> /test/target/spark16-0.0.1-SNAPSHOT.jar --jars
>> /home/user/download/jar/ojdbc7.jar
>>
>> *I'm getting below exception :*
>>
>> Exception in thread "main" java.sql.SQLException: No suitable driver
>> found for jdbc:oracle:thin:@:1521:xxx
>> at java.sql.DriverManager.getConnection(DriverManager.java:596)
>> at java.sql.DriverManager.getConnection(DriverManager.java:187)
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
>> at
>> org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
>> at
>> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
>> at com.cisco.ss.etl.utils.ETLHelper$class.getData(ETLHelper.scala:22)
>> at com.cisco.ss.etl.Main$.getData(Main.scala:9)
>> at com.cisco.ss.etl.Main$delayedInit$body.apply(Main.scala:13)
>> at scala.Function0$class.apply

spark-submit for dependent jars

2015-12-21 Thread Madabhattula Rajesh Kumar
Hi,

How to add dependent jars in spark-submit command. For example: Oracle.
Could you please help me to resolve this issue

I have a standalone cluster. One Master and One slave.

I have used below command it is not working

./spark-submit --master local  --class test.Main
/test/target/spark16-0.0.1-SNAPSHOT.jar --jars
/home/user/download/jar/ojdbc7.jar

*I'm getting below exception :*

Exception in thread "main" java.sql.SQLException: No suitable driver found
for jdbc:oracle:thin:@:1521:xxx
at java.sql.DriverManager.getConnection(DriverManager.java:596)
at java.sql.DriverManager.getConnection(DriverManager.java:187)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:188)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$getConnector$1.apply(JDBCRDD.scala:181)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:121)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:60)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at com.cisco.ss.etl.utils.ETLHelper$class.getData(ETLHelper.scala:22)
at com.cisco.ss.etl.Main$.getData(Main.scala:9)
at com.cisco.ss.etl.Main$delayedInit$body.apply(Main.scala:13)
at scala.Function0$class.apply$mcV$sp(Function0.scala:40)
at
scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.App$$anonfun$main$1.apply(App.scala:71)
at scala.collection.immutable.List.foreach(List.scala:318)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.App$class.main(App.scala:71)
at com.cisco.ss.etl.Main$.main(Main.scala:9)
at com.cisco.ss.etl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
at
org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Regards,
Rajesh


Creation of RDD in foreachAsync is failing

2015-12-11 Thread Madabhattula Rajesh Kumar
Hi,

I have a below query. Please help me to solve this

I have a 2 ids. I want to join these ids to table. This table contains
some blob data. So i can not join these 2 ids to this table in one step.

I'm planning to join this table in a chunks. For example, each step I will
join 5000 ids. The total number of batches are : 2/500 = 40

I want to run these 40 batches in parallel. For that, I'm using
*foreachAsync* method. Now I'm getting below exception

*An error has occured: Job aborted due to stage failure: Task 0 in stage
1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1,
localhost): java.lang.IllegalStateException: Cannot call methods on a
stopped SparkContext.*

*Code :- *

var listOfEqs = //ListBuffer. Number of elements are 2
var listOfEqsSeq = listOfEqs.grouped(500).toList
var listR = sc.parallelize(listOfEqsSeq)
var asyncRd =  new AsyncRDDActions(listR)

val f = asyncRd.foreachAsync { x =>
{
val r = sc.parallelize(x).toDF()  ==> This line I'm getting above
mentioned exception
r.registerTempTable("r")
 val acc = sc.accumulator(0, "My Accumulator")
var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id = t.id
")
 result.foreach{ y =>
 {
 acc += y
  }
}
acc.value.foreach(f => // saving values to other db)
}
f.onComplete {
case scala.util.Success(res) => println(res)
case scala.util.Failure(e)=> println("An error has occured: " +
e.getMessage)
}

Please help me to solve this issue

Regards,
Rajesh


Re: How to use collections inside foreach block

2015-12-10 Thread Madabhattula Rajesh Kumar
Hi Rishi and Ted,

Thank you for the response. Now I'm using Accumulators and getting results.
I have a another query, how to start parallel the code.

Example :-

var listOfIds is a ListBuffer with 2 records

I'm creating batches. For each batch size is 500. It means, total batches
are : 40.

listOfIds.grouped(500).foreach { x =>
{
val r = sc.parallelize(x).toDF()
r.registerTempTable("r")
 val acc = sc.accumulator(0, "My Accumulator")
var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id = t.id
")
 result.foreach{ y =>
 {
 acc += y
  }
}
acc.value.foreach(f => // saveing values to other db)
}

Above code is working in sequence. I want to run these 40 batches in
parallel.

*How to start these 40 bathes in parallel instead of sequence. *

Could you please help me to resolve this use case.

Regards,
Rajesh


On Wed, Dec 9, 2015 at 4:46 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> To add onto what Rishi said, you can use foreachPartition() on result
> where you can save values to DB.
>
> Cheers
>
> On Wed, Dec 9, 2015 at 12:51 AM, Rishi Mishra <rmis...@snappydata.io>
> wrote:
>
>> Your list is defined on the driver, whereas function specified in forEach
>> will be evaluated on each executor.
>> You might want to add an accumulator or handle a Sequence of list from
>> each partition.
>>
>> On Wed, Dec 9, 2015 at 11:19 AM, Madabhattula Rajesh Kumar <
>> mrajaf...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a below query. Please help me to solve this
>>>
>>> I have a 2 ids. I want to join these ids to table. This table
>>> contains some blob data. So i can not join these 2000 ids to this table in
>>> one step.
>>>
>>> I'm planning to join this table in a chunks. For example, each step I
>>> will join 5000 ids.
>>>
>>> Below code is not working. I'm not able to add result to ListBuffer.
>>> Result s giving always ZERO
>>>
>>> *Code Block :-*
>>>
>>> var listOfIds is a ListBuffer with 2 records
>>>
>>> listOfIds.grouped(5000).foreach { x =>
>>> {
>>> var v1 = new ListBuffer[String]()
>>> val r = sc.parallelize(x).toDF()
>>> r.registerTempTable("r")
>>> var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id =
>>> t.id")
>>>  result.foreach{ y =>
>>>  {
>>>  v1 += y
>>>   }
>>> }
>>> println(" SIZE OF V1 === "+ v1.size)  ==>
>>>
>>> *THIS VALUE PRINTING AS ZERO*
>>>
>>> *// Save v1 values to other db*
>>> }
>>>
>>> Please help me on this.
>>>
>>> Regards,
>>> Rajesh
>>>
>>
>>
>>
>> --
>> Regards,
>> Rishitesh Mishra,
>> SnappyData . (http://www.snappydata.io/)
>>
>> https://in.linkedin.com/in/rishiteshmishra
>>
>
>


How to use collections inside foreach block

2015-12-08 Thread Madabhattula Rajesh Kumar
Hi,

I have a below query. Please help me to solve this

I have a 2 ids. I want to join these ids to table. This table contains
some blob data. So i can not join these 2000 ids to this table in one step.

I'm planning to join this table in a chunks. For example, each step I will
join 5000 ids.

Below code is not working. I'm not able to add result to ListBuffer. Result
s giving always ZERO

*Code Block :-*

var listOfIds is a ListBuffer with 2 records

listOfIds.grouped(5000).foreach { x =>
{
var v1 = new ListBuffer[String]()
val r = sc.parallelize(x).toDF()
r.registerTempTable("r")
var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id = t.id
")
 result.foreach{ y =>
 {
 v1 += y
  }
}
println(" SIZE OF V1 === "+ v1.size)  ==>

*THIS VALUE PRINTING AS ZERO*

*// Save v1 values to other db*
}

Please help me on this.

Regards,
Rajesh


Spark SQL IN Clause

2015-12-04 Thread Madabhattula Rajesh Kumar
Hi,

How to use/best practices "IN" clause in Spark SQL.

Use Case :-  Read the table based on number. I have a List of numbers. For
example, 1million.

Regards,
Rajesh


Re: How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi JB and Ted,

Thank you very much for the steps

Regards,
Rajesh

On Thu, Dec 3, 2015 at 8:16 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> See this thread for Spark 1.6.0 RC1
>
>
> http://search-hadoop.com/m/q3RTtKdUViYHH1b1=+VOTE+Release+Apache+Spark+1+6+0+RC1+
>
> Cheers
>
> On Thu, Dec 3, 2015 at 12:39 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi Team,
>>
>> Looks like this issue is fixed in 1.6 release. How to test this fix? Is
>> any jar is available? So I can add that jar in dependency and test this
>> fix. (Or) Any other way, I can test this fix in 1.15.2 code base.
>>
>> Could you please let me know the steps. Thank you for your support
>>
>> Regards,
>> Rajesh
>>
>
>


How to test https://issues.apache.org/jira/browse/SPARK-10648 fix

2015-12-03 Thread Madabhattula Rajesh Kumar
Hi Team,

Looks like this issue is fixed in 1.6 release. How to test this fix? Is any
jar is available? So I can add that jar in dependency and test this fix.
(Or) Any other way, I can test this fix in 1.15.2 code base.

Could you please let me know the steps. Thank you for your support

Regards,
Rajesh


Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Prem,

Thank you for the details. I'm not able to build. I'm facing some issues.

Any repository link, where I can download (preview version of)  1.6
version of spark-core_2.11 and spark-sql_2.11 jar files.

Regards,
Rajesh

On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure <premsure...@gmail.com> wrote:

> you can refer..:
> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>
>
> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm not able to build Spark 1.6 from source. Could you please share the
>> steps to build Spark 1.16
>>
>> Regards,
>> Rajesh
>>
>
>


Re: Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi Ted,

I'm not able find "spark-core_2.11 and spark-sql_2.11 jar files" in above
link.

Regards,
Rajesh

On Tue, Nov 24, 2015 at 11:03 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> See:
>
> http://search-hadoop.com/m/q3RTtF1Zmw12wTWX/spark+1.6+preview=+ANNOUNCE+Spark+1+6+0+Release+Preview
>
> On Tue, Nov 24, 2015 at 9:31 AM, Madabhattula Rajesh Kumar <
> mrajaf...@gmail.com> wrote:
>
>> Hi Prem,
>>
>> Thank you for the details. I'm not able to build. I'm facing some issues.
>>
>> Any repository link, where I can download (preview version of)  1.6
>> version of spark-core_2.11 and spark-sql_2.11 jar files.
>>
>> Regards,
>> Rajesh
>>
>> On Tue, Nov 24, 2015 at 6:03 PM, Prem Sure <premsure...@gmail.com> wrote:
>>
>>> you can refer..:
>>> https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/building-spark.html#building-with-buildmvn
>>>
>>>
>>> On Tue, Nov 24, 2015 at 7:16 AM, Madabhattula Rajesh Kumar <
>>> mrajaf...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I'm not able to build Spark 1.6 from source. Could you please share the
>>>> steps to build Spark 1.16
>>>>
>>>> Regards,
>>>> Rajesh
>>>>
>>>
>>>
>>
>


Spark 1.6 Build

2015-11-24 Thread Madabhattula Rajesh Kumar
Hi,

I'm not able to build Spark 1.6 from source. Could you please share the
steps to build Spark 1.16

Regards,
Rajesh


Spark 1.5.3 release

2015-11-19 Thread Madabhattula Rajesh Kumar
Hi,

Please let me know Spark 1.5.3 release date details

Regards,
Rajesh


Re: Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Madabhattula Rajesh Kumar
Hi Richard,

Thank you for the updates. Do you know tentative timeline for 1.6 release?
Mean while, any workaround solution for this issue?

Regards,
Rajesh



On Thu, Nov 5, 2015 at 10:57 PM, Richard Hillegas <rhil...@us.ibm.com>
wrote:

> Or you may be referring to
> https://issues.apache.org/jira/browse/SPARK-10648. That issue has a
> couple pull requests but I think that the limited bandwidth of the
> committers still applies.
>
> Thanks,
> Rick
>
>
> Richard Hillegas/San Francisco/IBM@IBMUS wrote on 11/05/2015 09:16:42 AM:
>
> > From: Richard Hillegas/San Francisco/IBM@IBMUS
> > To: Madabhattula Rajesh Kumar <mrajaf...@gmail.com>
> > Cc: "user@spark.apache.org" <user@spark.apache.org>,
> > "u...@spark.incubator.apache.org" <u...@spark.incubator.apache.org>
> > Date: 11/05/2015 09:17 AM
> > Subject: Re: Spark sql jdbc fails for Oracle NUMBER type columns
>
> >
> > Hi Rajesh,
> >
> > I think that you may be referring to https://issues.apache.org/jira/
> > browse/SPARK-10909. A pull request on that issue was submitted more
> > than a month ago but it has not been committed. I think that the
> > committers are busy working on issues which were targeted for 1.6
> > and I doubt that they will have the spare cycles to vet that pull
> request.
> >
> > Thanks,
> > Rick
> >
> >
> > Madabhattula Rajesh Kumar <mrajaf...@gmail.com> wrote on 11/05/2015
> > 05:51:29 AM:
> >
> > > From: Madabhattula Rajesh Kumar <mrajaf...@gmail.com>
> > > To: "user@spark.apache.org" <user@spark.apache.org>,
> > > "u...@spark.incubator.apache.org" <u...@spark.incubator.apache.org>
> > > Date: 11/05/2015 05:51 AM
> > > Subject: Spark sql jdbc fails for Oracle NUMBER type columns
> > >
> > > Hi,
> >
> > > Is this issue fixed in 1.5.1 version?
> >
> > > Regards,
> > > Rajesh
>
>


Spark sql jdbc fails for Oracle NUMBER type columns

2015-11-05 Thread Madabhattula Rajesh Kumar
Hi,

Is this issue fixed in 1.5.1 version?

Regards,
Rajesh


Spark Titan

2015-06-21 Thread Madabhattula Rajesh Kumar
Hi,

How to connect TItan database from Spark? Any out of the box api's
available?

Regards,
Rajesh


Pyspark saveAsTextFile exceptions

2015-03-13 Thread Madabhattula Rajesh Kumar
Hi Team,

I'm getting below exception for saving the results into hadoop.


*Code :*
rdd.saveAsTextFile(hdfs://localhost:9000/home/rajesh/data/result.rdd)

Could you please help me how to resolve this issue.

15/03/13 17:19:31 INFO spark.SparkContext: Starting job: saveAsTextFile at
NativeMethodAccessorImpl.java:-2
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Got job 6 (saveAsTextFile at
NativeMethodAccessorImpl.java:-2) with 4 output partitions
(allowLocal=false)
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Final stage: Stage
10(saveAsTextFile at NativeMethodAccessorImpl.java:-2)
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Parents of final stage:
List()
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Missing parents: List()
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Submitting Stage 10
(MappedRDD[31] at saveAsTextFile at NativeMethodAccessorImpl.java:-2),
which has no missing parents
15/03/13 17:19:31 INFO storage.MemoryStore: ensureFreeSpace(98240) called
with curMem=203866, maxMem=280248975
15/03/13 17:19:31 INFO storage.MemoryStore: Block broadcast_9 stored as
values in memory (estimated size 95.9 KB, free 267.0 MB)
15/03/13 17:19:31 INFO storage.MemoryStore: ensureFreeSpace(59150) called
with curMem=302106, maxMem=280248975
15/03/13 17:19:31 INFO storage.MemoryStore: Block broadcast_9_piece0 stored
as bytes in memory (estimated size 57.8 KB, free 266.9 MB)
15/03/13 17:19:31 INFO storage.BlockManagerInfo: Added broadcast_9_piece0
in memory on localhost:57655 (size: 57.8 KB, free: 267.2 MB)
15/03/13 17:19:31 INFO storage.BlockManagerMaster: Updated info of block
broadcast_9_piece0
15/03/13 17:19:31 INFO spark.SparkContext: Created broadcast 9 from
broadcast at DAGScheduler.scala:838
15/03/13 17:19:31 INFO scheduler.DAGScheduler: Submitting 4 missing tasks
from Stage 10 (MappedRDD[31] at saveAsTextFile at
NativeMethodAccessorImpl.java:-2)
15/03/13 17:19:31 INFO scheduler.TaskSchedulerImpl: Adding task set 10.0
with 4 tasks
15/03/13 17:19:31 INFO scheduler.TaskSetManager: Starting task 0.0 in stage
10.0 (TID 8, localhost, PROCESS_LOCAL, 1375 bytes)
15/03/13 17:19:31 INFO executor.Executor: Running task 0.0 in stage 10.0
(TID 8)
15/03/13 17:19:31 INFO executor.Executor: Fetching
http://10.0.2.15:54815/files/sftordd_pickle with timestamp 1426247370763
15/03/13 17:19:31 INFO util.Utils: Fetching
http://10.0.2.15:54815/files/sftordd_pickle to
/tmp/fetchFileTemp7846328782039551224.tmp
15/03/13 17:19:31 INFO Configuration.deprecation: mapred.output.dir is
deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/03/13 17:19:31 INFO Configuration.deprecation: mapred.output.key.class
is deprecated. Instead, use mapreduce.job.output.key.class
15/03/13 17:19:31 INFO Configuration.deprecation: mapred.output.value.class
is deprecated. Instead, use mapreduce.job.output.value.class
15/03/13 17:19:31 INFO Configuration.deprecation: mapred.working.dir is
deprecated. Instead, use mapreduce.job.working.dir
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
15/03/13 17:19:31 ERROR python.PythonRDD: Python worker exited unexpectedly
(crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call
last):
  File /home/rajesh/spark-1.2.0/python/pyspark/worker.py, line 90, in main
command = pickleSer._read_with_length(infile)
  File /home/rajesh/spark-1.2.0/python/pyspark/serializers.py, line 145,
in _read_with_length
length = read_int(stream)
  File /home/rajesh/spark-1.2.0/python/pyspark/serializers.py, line 511,
in read_int
raise EOFError
EOFError

at
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:137)
at
org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:174)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.Exception: Subprocess exited with status 134
at org.apache.spark.rdd.PipedRDD$$anon$1.hasNext(PipedRDD.scala:161)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at

Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi,

Could you please let me know how to do this? (or) Any suggestion

Regards,
Rajesh

On Mon, Mar 2, 2015 at 4:47 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh



Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi Robin,

Thank you for your response. Please find below my question. I have a below
edge file

  Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6  6 6
In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is
connected to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th
vertex is a root node. Please find below graph

[image: Inline image 1]
In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6.
Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6
because this is the root node.

I'm planning to use pergel API but I'm not able to define messages and
vertex program in that API. Could you please help me on this.

Please let me know if you need more information.

Regards,
Rajesh


On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk wrote:

 Rajesh

 I'm not sure if I can help you, however I don't even understand the
 question. Could you restate what you are trying to do.

 Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com
 wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh




Re: GraphX path traversal

2015-03-03 Thread Madabhattula Rajesh Kumar
Hi,

I have tried below program using pergel API but I'm not able to get my
required output. I'm getting exactly reverse output which I'm expecting.

// Creating graph using above mail mentioned edgefile
 val graph: Graph[Int, Int] = GraphLoader.edgeListFile(sc,
/home/rajesh/Downloads/graphdata/data.csv).cache()

 val parentGraph = Pregel(
  graph.mapVertices((id, attr) = Set[VertexId]()),
  Set[VertexId](),
  Int.MaxValue,
  EdgeDirection.Out)(
(id, attr, msg) = (msg ++ attr),
edge = { if (edge.srcId != edge.dstId)
  { Iterator((edge.dstId, (edge.srcAttr + edge.srcId)))
  }
  else Iterator.empty
 },
(a, b) = (a ++ b))
parentGraph.vertices.collect.foreach(println(_))

*Output :*

(4,Set(1, 2, 3))
(1,Set())
(6,Set(5, 1, 2, 3, 4))
(3,Set(1, 2))
(5,Set(1, 2, 3, 4))
(2,Set(1))

*But I'm looking below output. *

(4,Set(5, 6))
(1,Set(2, 3, 4, 5, 6))
(6,Set())
(3,Set(4, 5, 6))
(5,Set(6))
(2,Set(3, 4, 5, 6))

Could you please correct me where I'm doing wrong.

Regards,
Rajesh


On Tue, Mar 3, 2015 at 8:42 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Robin,

 Thank you for your response. Please find below my question. I have a below
 edge file

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6  6 6
 In this graph 1st vertex is connected to 2nd vertex, 2nd Vertex is
 connected to 3rd vertex,. 6th vertex is connected to 6th vertex. So 6th
 vertex is a root node. Please find below graph

 [image: Inline image 1]
 In this graph, How can I compute the 1st vertex parents like 2,3,4,5,6.
 Similarly 2nd vertex parents like 3,4,5,6  6th vertex parent like 6
 because this is the root node.

 I'm planning to use pergel API but I'm not able to define messages and
 vertex program in that API. Could you please help me on this.

 Please let me know if you need more information.

 Regards,
 Rajesh


 On Tue, Mar 3, 2015 at 8:15 PM, Robin East robin.e...@xense.co.uk wrote:

 Rajesh

 I'm not sure if I can help you, however I don't even understand the
 question. Could you restate what you are trying to do.

 Sent from my iPhone

 On 2 Mar 2015, at 11:17, Madabhattula Rajesh Kumar mrajaf...@gmail.com
 wrote:

 Hi,

 I have a below edge list. How to find the parents path for every vertex?

 Example :

 Vertex 1 path : 2, 3, 4, 5, 6
 Vertex 2 path : 3, 4, 5, 6
 Vertex 3 path : 4,5,6
 vertex 4 path : 5,6
 vertex 5 path : 6

 Could you please let me know how to do this? (or) Any suggestion

   Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
 Regards,
 Rajesh





GraphX path traversal

2015-03-02 Thread Madabhattula Rajesh Kumar
Hi,

I have a below edge list. How to find the parents path for every vertex?

Example :

Vertex 1 path : 2, 3, 4, 5, 6
Vertex 2 path : 3, 4, 5, 6
Vertex 3 path : 4,5,6
vertex 4 path : 5,6
vertex 5 path : 6

Could you please let me know how to do this? (or) Any suggestion

  Source Vertex Destination Vertex  1 2  2 3  3 4  4 5  5 6
Regards,
Rajesh


GraphX vs GraphLab

2015-01-12 Thread Madabhattula Rajesh Kumar
Hi Team,

Is any one done comparison(pros and cons ) study between GraphX ad GraphLab.

Could you please let me know any links for this comparison.

Regards,
Rajesh


Re: when will the spark 1.3.0 be released?

2014-12-17 Thread Madabhattula Rajesh Kumar
Hi All,

When will the Spark 1.2.0 be released? and What are the features in Spark
1.2.0

Regards,
Rajesh

On Wed, Dec 17, 2014 at 11:14 AM, Andrew Ash and...@andrewash.com wrote:

 Releases are roughly every 3mo so you should expect around March if the
 pace stays steady.

 2014-12-16 22:56 GMT-05:00 Marco Shaw marco.s...@gmail.com:

 When it is ready.



  On Dec 16, 2014, at 11:43 PM, 张建轶 zhangjia...@youku.com wrote:
 
  Hi £¡
 
  when will the spark 1.3.0 be released£¿
  I want to use new LDA feature.
  Thank you!
 B‹CB• È
 [œÝXœØÜšX™K  K[XZ[ ˆ \Ù\‹][œÝXœØÜšX™P Ü \šË˜\ XÚ K›Ü™ÃB‘›Üˆ Y  ] [Û˜[
 ÛÛ[X[™ Ë  K[XZ[ ˆ \Ù\‹Z [   Ü \šË˜\ XÚ K›Ü™ÃBƒB

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Hi Helena and All,

I have found one example multi-line json file into an RDD using 
https://github.com/alexholmes/json-mapreduce;.

val data = sc.newAPIHadoopFile(
filepath,
classOf[MultiLineJsonInputFormat],
classOf[LongWritable],
classOf[Text],
conf ).map(p = (p._1.get, p._2.toString))
 data.count

It is expecting Conf object. What Conf value I need to specify and how
to specify.
MultiLineJsonInputFormat class is expecting member value. How to
pass member value. Otherwise I'm getting below exception

















*java.io.IOException: Missing configuration value for
multilinejsoninputformat.member at
com.alexholmes.json.mapreduce.MultiLineJsonInputFormat.createRecordReader(MultiLineJsonInputFormat.java:30)
 at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
at
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)   at
org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)at
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)  at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)  at
org.apache.spark.scheduler.Task.run(Task.scala:54)  at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at
java.lang.Thread.run(Thread.java:745)*

Please let me know who to resolve this issue

Regards,
Rajesh


On Sun, Dec 14, 2014 at 7:21 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Thank you Yanbo

 Regards,
 Rajesh

 On Sun, Dec 14, 2014 at 3:15 PM, Yanbo yanboha...@gmail.com wrote:

 Pay attention to your JSON file, try to change it like following.
 Each record represent as a JSON string.

  {NAME : Device 1,
   GROUP : 1,
   SITE : qqq,
   DIRECTION : East,
  }
  {NAME : Device 2,
   GROUP : 2,
   SITE : sss,
   DIRECTION : North,
  }

  在 2014年12月14日,下午5:01,Madabhattula Rajesh Kumar mrajaf...@gmail.com
 写道:
 
  { Device 1 :
   {NAME : Device 1,
GROUP : 1,
SITE : qqq,
DIRECTION : East,
   }
   Device 2 :
   {NAME : Device 2,
GROUP : 2,
SITE : sss,
DIRECTION : North,
   }
  }




Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Thank you Peter for the clarification.

Regards,
Rajesh

On Tue, Dec 16, 2014 at 12:42 AM, Michael Armbrust mich...@databricks.com
wrote:

 Underneath the covers, jsonFile uses TextInputFormat, which will split
 files correctly based on new lines.  Thus, there is no fixed maximum size
 for a json object (other than the fact that it must fit into memory on the
 executors).

 On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Peter,

 Thank you for the clarification.

 Now we need to store each JSON object into one line. Is there any
 limitation of length of JSON object? So, JSON object will not go to the
 next line.

 What will happen if JSON object is a big/huge one?  Will it store in a
 single line in HDFS?

 What will happen, if JSON object contains BLOB/CLOB value? Is this entire
 JSON object stores in single line of HDFS?

 What will happen, if JSON object exceeding the HDFS block size. For
 example, single JSON object split into two different worker nodes. In this
 case, How Spark will read this JSON object?

 Could you please clarify above questions

 Regards,
 Rajesh


 On Mon, Dec 15, 2014 at 6:52 PM, Peter Vandenabeele 
 pe...@vandenabeele.com wrote:



 On Sat, Dec 13, 2014 at 5:43 PM, Helena Edelson 
 helena.edel...@datastax.com wrote:

 One solution can be found here:
 https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets


 As far as I understand, the people.json file is not really a proper json
 file, but a file documented as:

   ... JSON files where each line of the files is a JSON object..

 This means that is a file with multiple lines, but each line needs to
 have a fully self-contained JSON object
 (initially confusing, this will not parse a standard multi-line JSON
 file). We are working to clarify this in
 https://github.com/apache/spark/pull/3517

 HTH,

 Peter




 - Helena
 @helenaedelson

 On Dec 13, 2014, at 11:18 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Team,

 I have a large JSON file in Hadoop. Could you please let me know

 1. How to read the JSON file
 2. How to parse the JSON file

 Please share any example program based on Scala

 Regards,
 Rajesh





 --
 Peter Vandenabeele
 http://www.allthingsdata.io
 http://www.linkedin.com/in/petervandenabeele
 https://twitter.com/peter_v
 gsm: +32-478-27.40.69
 e-mail: pe...@vandenabeele.com
 skype: peter_v_be




Re: JSON Input files

2014-12-14 Thread Madabhattula Rajesh Kumar
Hi Helena and All,

I have a below example JSON file format. My use case is to read NAME
variable.

When I execute I got next exception

*Exception in thread main
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Unresolved
attributes: 'NAME, tree:Project ['NAME] Subquery device*

*Please let me know how to read values from JSON using Spark SQL*

*CODE BLOCK :*




*val device =
sqlContext.jsonFile(hdfs://localhost:9000/user/rajesh/json/test.json)
device.registerAsTable(device)device.printSchemaval results =
sqlContext.sql(SELECT NAME FROM device).collect.foreach(println)*

*JSON format :*

{ Device 1 :
 {NAME : Device 1,
  GROUP : 1,
  SITE : qqq,
  DIRECTION : East,
 }
 Device 2 :
 {NAME : Device 2,
  GROUP : 2,
  SITE : sss,
  DIRECTION : North,
 }
}



On Sat, Dec 13, 2014 at 10:13 PM, Helena Edelson 
helena.edel...@datastax.com wrote:

 One solution can be found here:
 https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets

 - Helena
 @helenaedelson

 On Dec 13, 2014, at 11:18 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Team,

 I have a large JSON file in Hadoop. Could you please let me know

 1. How to read the JSON file
 2. How to parse the JSON file

 Please share any example program based on Scala

 Regards,
 Rajesh





Re: JSON Input files

2014-12-14 Thread Madabhattula Rajesh Kumar
Thank you Yanbo

Regards,
Rajesh

On Sun, Dec 14, 2014 at 3:15 PM, Yanbo yanboha...@gmail.com wrote:

 Pay attention to your JSON file, try to change it like following.
 Each record represent as a JSON string.

  {NAME : Device 1,
   GROUP : 1,
   SITE : qqq,
   DIRECTION : East,
  }
  {NAME : Device 2,
   GROUP : 2,
   SITE : sss,
   DIRECTION : North,
  }

  在 2014年12月14日,下午5:01,Madabhattula Rajesh Kumar mrajaf...@gmail.com 写道:
 
  { Device 1 :
   {NAME : Device 1,
GROUP : 1,
SITE : qqq,
DIRECTION : East,
   }
   Device 2 :
   {NAME : Device 2,
GROUP : 2,
SITE : sss,
DIRECTION : North,
   }
  }



JSON Input files

2014-12-13 Thread Madabhattula Rajesh Kumar
Hi Team,

I have a large JSON file in Hadoop. Could you please let me know

1. How to read the JSON file
2. How to parse the JSON file

Please share any example program based on Scala

Regards,
Rajesh


Re: Spark and Stanford CoreNLP

2014-11-24 Thread Madabhattula Rajesh Kumar
Hello,

I'm new to Stanford CoreNLP. Could any one share good training material and
examples(java or scala) on NLP.

Regards,
Rajesh

On Mon, Nov 24, 2014 at 9:38 PM, Ian O'Connell i...@ianoconnell.com wrote:


 object MyCoreNLP {
   @transient lazy val coreNLP = new coreNLP()
 }

 and then refer to it from your map/reduce/map partitions or that it should
 be fine (presuming its thread safe), it will only be initialized once per
 classloader per jvm

 On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks evan.spa...@gmail.com
 wrote:

 We have gotten this to work, but it requires instantiating the CoreNLP
 object on the worker side. Because of the initialization time it makes a
 lot of sense to do this inside of a .mapPartitions instead of a .map, for
 example.

 As an aside, if you're using it from Scala, have a look at sistanlp,
 which provided a nicer, scala-friendly interface to CoreNLP.


  On Nov 24, 2014, at 7:46 AM, tvas theodoros.vasilou...@gmail.com
 wrote:
 
  Hello,
 
  I was wondering if anyone has gotten the Stanford CoreNLP Java library
 to
  work with Spark.
 
  My attempts to use the parser/annotator fail because of task
 serialization
  errors since the class
  StanfordCoreNLP cannot be serialized.
 
  I've tried the remedies of registering StanfordCoreNLP through kryo, as
 well
  as using chill.MeatLocker,
  but these still produce serialization errors.
  Passing the StanfordCoreNLP object as transient leads to a
  NullPointerException instead.
 
  Has anybody managed to get this work?
 
  Regards,
  Theodore
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Stanford-CoreNLP-tp19654.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





To find distances to reachable source vertices using GraphX

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All,

I'm trying to understand below link example program. When I run this
program, I'm getting *java.lang.NullPointerException* at below
highlighted line.

*https://gist.github.com/ankurdave/4a17596669b36be06100
https://gist.github.com/ankurdave/4a17596669b36be06100*

val updatedDists = edge.srcAttr.filter {
case (source, dist) =
*val existingDist = edge.dstAttr.getOrElse(source, Int.MaxValue) *
existingDist  dist + 1
}.mapValues(_ + 1).map(identity)

Could you please help me to resolve this issue.

Regards,
Rajesh


GraphX : Vertices details in Triangles

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All,

I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to
get number of triangles in a graph but I'm not able to collect vertices
details in each Triangle.

*For example* : I'm playing one of the graphx graph example Vertices and
Edges

val vertexArray = Array(
  (1L, (Alice, 28)),
  (2L, (Bob, 27)),
  (3L, (Charlie, 65)),
  (4L, (David, 42)),
  (5L, (Ed, 55)),
  (6L, (Fran, 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

[image: Toy Social Network]

I'm able to get 3 Triangles count in the above graph. I want to know
Vertices details in a each Triangle.

*For example like* :

Triangle 1 :  1, 2, 4
Triangle 2 :  2, 5, 3
Triangle 3 :  5, 3, 6

Could you please help on this.

Regards,
Rajesh


Fwd: GraphX : Vertices details in Triangles

2014-11-03 Thread Madabhattula Rajesh Kumar
Hi All,

I'm new to GraphX. I'm understanding Triangle Count use cases. I'm able to
get number of triangles in a graph but I'm not able to collect vertices
details in each Triangle.

*For example* : I'm playing one of the graphx graph example Vertices and
Edges

val vertexArray = Array(
  (1L, (Alice, 28)),
  (2L, (Bob, 27)),
  (3L, (Charlie, 65)),
  (4L, (David, 42)),
  (5L, (Ed, 55)),
  (6L, (Fran, 50))
  )
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3)
  )

[image: Toy Social Network]

I'm able to get 3 Triangles count in the above graph. I want to know
Vertices details in a each Triangle.

*For example like* :

Triangle 1 :  1, 2, 4
Triangle 2 :  2, 5, 3
Triangle 3 :  5, 3, 6

Could you please help on this.

Regards,
Rajesh


Re: To find distances to reachable source vertices using GraphX

2014-11-03 Thread Madabhattula Rajesh Kumar
Thank you Ankur for your help and support!!!

On Tue, Nov 4, 2014 at 5:24 AM, Ankur Dave ankurd...@gmail.com wrote:

 The NullPointerException seems to be because edge.dstAttr is null, which
 might be due to SPARK-3936
 https://issues.apache.org/jira/browse/SPARK-3936. Until that's fixed, I
 edited the Gist with a workaround. Does that fix the problem?

 Ankur http://www.ankurdave.com/

 On Mon, Nov 3, 2014 at 12:23 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi All,

 I'm trying to understand below link example program. When I run this
 program, I'm getting *java.lang.NullPointerException* at below
 highlighted line.

 *https://gist.github.com/ankurdave/4a17596669b36be06100
 https://gist.github.com/ankurdave/4a17596669b36be06100*

 val updatedDists = edge.srcAttr.filter {
 case (source, dist) =
 *val existingDist = edge.dstAttr.getOrElse(source, Int.MaxValue) *
 existingDist  dist + 1
 }.mapValues(_ + 1).map(identity)

 Could you please help me to resolve this issue.

 Regards,
 Rajesh





Re: Spark Streaming + Actors

2014-09-26 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please respond on my below request.

Regards,
Rajesh



On Thu, Sep 25, 2014 at 11:38 PM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Team,

 Can I use Actors in Spark Streaming based on events type? Could you please
 review below Test program and let me know if any thing I need to change
 with respect to best practices

 import akka.actor.Actor
 import akka.actor.{ActorRef, Props}
 import org.apache.spark.SparkConf
 import org.apache.spark.streaming.StreamingContext
 import org.apache.spark.streaming.Seconds
 import akka.actor.ActorSystem

 case class one(r: org.apache.spark.rdd.RDD[String])
 case class two(s: org.apache.spark.rdd.RDD[String])

 class Events extends Actor
 {
   def receive = {
 // Based on event type - Invoke respective methods asynchronously
 case one(r) = println(ONE COUNT + r.count) // Invoke respective
 functions
 case two(s) = println(TWO COUNT + s.count) // Invoke respective
 functions
   }
 }

 object Test {

 def main(args: Array[String]) {
 val system = ActorSystem(System)
 val event: ActorRef = system.actorOf(Props[Events], events)
 val sparkConf = new SparkConf() setAppName(AlertsLinesCount)
 setMaster(local)
 val ssc = new StreamingContext(sparkConf, Seconds(30))
 val lines = ssc
 textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/)
 lines foreachRDD(x = {
   event ! one(x)
   event ! two(x)
 })
 ssc.start
 ssc.awaitTermination
 }
 }

 Regards,
 Rajesh



Spark Streaming + Actors

2014-09-25 Thread Madabhattula Rajesh Kumar
Hi Team,

Can I use Actors in Spark Streaming based on events type? Could you please
review below Test program and let me know if any thing I need to change
with respect to best practices

import akka.actor.Actor
import akka.actor.{ActorRef, Props}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import akka.actor.ActorSystem

case class one(r: org.apache.spark.rdd.RDD[String])
case class two(s: org.apache.spark.rdd.RDD[String])

class Events extends Actor
{
  def receive = {
// Based on event type - Invoke respective methods asynchronously
case one(r) = println(ONE COUNT + r.count) // Invoke respective
functions
case two(s) = println(TWO COUNT + s.count) // Invoke respective
functions
  }
}

object Test {

def main(args: Array[String]) {
val system = ActorSystem(System)
val event: ActorRef = system.actorOf(Props[Events], events)
val sparkConf = new SparkConf() setAppName(AlertsLinesCount)
setMaster(local)
val ssc = new StreamingContext(sparkConf, Seconds(30))
val lines = ssc
textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/)
lines foreachRDD(x = {
  event ! one(x)
  event ! two(x)
})
ssc.start
ssc.awaitTermination
}
}

Regards,
Rajesh


Spark : java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team,

I'm getting below exception. Could you please me to resolve this issue.

Below is my piece of code

val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])

var s =rdd.map(x = x._2)

var a = test(s.collect)

def test(s:Array[org.apache.hadoop.hbase.client.Result])
{

  var invRecords = HashMap[String, HashMap[String, String]]()
 s foreach (x = {
var invValues = HashMap[String, String]()
x rawCells () foreach (y = { invValues += Bytes.toString(y
getQualifier) - Bytes.toString(y getValue) })
invRecords += Bytes.toString(x getRow) - invValues
  })
println(* === +invRecords.size)

}

*java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result*
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1173)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:197)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:701)
2014-09-24 20:57:29,703 WARN  [Result resolver thread-0]
scheduler.TaskSetManager (Logging.scala:logWarning(70)) - Lost TID 0 (task
0.0:0)
2014-09-24 20:57:29,717 ERROR [Result resolver thread-0]
scheduler.TaskSetManager (Logging.scala:logError(74)) - Task 0.0:0 had a
not serializable result: java.io.NotSerializableException:
org.apache.hadoop.hbase.client.Result; not retrying
2014-09-24 20:57:29,722 INFO  [main] scheduler.DAGScheduler
(Logging.scala:logInfo(58)) - Failed to run collect at HBaseTest5.scala:26
*Exception in thread main org.apache.spark.SparkException: Job aborted
due to stage failure: Task 0.0:0 had a not serializable result:
java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result*
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please point me the example program for Spark HBase to read
columns and values

Regards,
Rajesh


Re: Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Ted,

Thank you for quick response and details. I have verified HBaseTest.scala
program it is returning hbaseRDD but i'm not able retrieve values from
hbaseRDD.

Could you help me to retrieve the values from hbaseRDD

Regards,
Rajesh


On Wed, Sep 24, 2014 at 10:12 PM, Ted Yu yuzhih...@gmail.com wrote:

 Take a look at the following under examples:

 examples/src//main/python/hbase_inputformat.py
 examples/src//main/python/hbase_outputformat.py
 examples/src//main/scala/org/apache/spark/examples/HBaseTest.scala

 examples/src//main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala

 Cheers

 On Wed, Sep 24, 2014 at 9:39 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Team,

 Could you please point me the example program for Spark HBase to read
 columns and values

 Regards,
 Rajesh





Re: Iterate over ArrayBuffer

2014-09-04 Thread Madabhattula Rajesh Kumar
Hi Deep,

If you are requirement is to read the values from ArrayBuffer use below code

scala import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer

scala var a = ArrayBuffer(5,3,1,4)
a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4)

scala for(b - a)
 | println(b)
5
3
1
4

scala


Regards,
Rajesh


On Thu, Sep 4, 2014 at 1:34 PM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have the following ArrayBuffer
 *ArrayBuffer(5,3,1,4)*
 Now, I want to iterate over the ArrayBuffer.
 What is the way to do it?

 Thank You



Re: Number of elements in ArrayBuffer

2014-09-02 Thread Madabhattula Rajesh Kumar
Hi Deep,

Please find below results of ArrayBuffer in scala REPL

scala import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.ArrayBuffer

scala val a = ArrayBuffer(5,3,1,4)
a: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(5, 3, 1, 4)

scala a.head
res2: Int = 5

scala a.tail
res3: scala.collection.mutable.ArrayBuffer[Int] = ArrayBuffer(3, 1, 4)

scala a.length
res4: Int = 4

Regards,
Rajesh


On Wed, Sep 3, 2014 at 10:13 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:

 Hi,
 I have the following ArrayBuffer:

 *ArrayBuffer(5,3,1,4)*

 Now, I want to get the number of elements in this ArrayBuffer and also the
 first element of the ArrayBuffer. I used .length and .size but they are
 returning 1 instead of 4.
 I also used .head and .last for getting the first and the last element but
 they also return the entire ArrayBuffer (ArrayBuffer(5,3,1,4))
 What I understand from this is that, the entire ArrayBuffer is stored as
 one element.

 How should I go about doing the required things?

 Thank You




Re: spark sql

2014-08-02 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please help me to resolve above compilation issue.

Regards,
Rajesh


On Sat, Aug 2, 2014 at 2:02 AM, Madabhattula Rajesh Kumar 
mrajaf...@gmail.com wrote:

 Hi Team,

 I'm not able to print the values from Spark Sql JavaSchemaRDD. Please find
 below my code

  JavaSQLContext sqlCtx = new JavaSQLContext(sc);

 NewHadoopRDDImmutableBytesWritable, Result rdd = new
 NewHadoopRDDImmutableBytesWritable, Result(
 JavaSparkContext.toSparkContext(sc),
 TableInputFormat.class, ImmutableBytesWritable.class,
 Result.class, conf);

 JavaRDDTuple2ImmutableBytesWritable, Result jrdd = rdd
 .toJavaRDD();

 ForEachFunction f = new ForEachFunction();

 JavaRDDANAInventory retrdd = jrdd.map(f);

 JavaSchemaRDD schemaPeople = sqlCtx.applySchema(retrdd,
 Test.class);
 schemaPeople.registerAsTable(retrdd);

 JavaSchemaRDD teenagers = sqlCtx.sql(SELECT * FROM retrdd);

 When i add below code. It is giving compilation issue. Could you please
 help me to resolve this issue.


 ListString teenagerNames = teenagers.map(new FunctionRow, String() {
   public String call(Row row) {
 return null;
   }
 }).collect();
 for (String name: teenagerNames) {
   System.out.println(name);
 }


 Compilation issue :

 The method map(FunctionRow,R in the type JavaSchemaRDD is not
 applicable for the arguments (new FunctionaRow, String(){})


 Thank you for your help

 Regards,
 Rajesh




Re: Hbase

2014-08-01 Thread Madabhattula Rajesh Kumar
Hi Akhil,

Thank you very much for your help and support.

Regards,
Rajesh


On Fri, Aug 1, 2014 at 7:57 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Here's a piece of code. In your case, you are missing the call() method
 inside the map function.


 import java.util.Iterator;

 import java.util.List;

 import org.apache.commons.configuration.Configuration;

 import org.apache.hadoop.hbase.HBaseConfiguration;

 import org.apache.hadoop.hbase.KeyValue;

 import org.apache.hadoop.hbase.client.Get;

 import org.apache.hadoop.hbase.client.HTable;

 import org.apache.hadoop.hbase.client.Result;

 import org.apache.hadoop.hbase.util.Bytes;

 import org.apache.spark.SparkConf;

 import org.apache.spark.SparkContext;

 import org.apache.spark.api.java.JavaRDD;

 import org.apache.spark.api.java.JavaSparkContext;

 import org.apache.spark.api.java.function.Function;

 import org.apache.spark.rdd.NewHadoopRDD;

 import org.apache.spark.streaming.Duration;

 import org.apache.spark.streaming.api.java.JavaStreamingContext;

 import org.apache.hadoop.hbase.io.ImmutableBytesWritable;

 import org.apache.hadoop.hbase.mapreduce.TableInputFormat;

 import com.google.common.collect.Lists;

 import scala.Function1;

 import scala.Tuple2;

 import scala.collection.JavaConversions;

 import scala.collection.Seq;

 import scala.collection.JavaConverters.*;

 import scala.reflect.ClassTag;

 public class SparkHBaseMain {

   @SuppressWarnings(deprecation)

  public static void main(String[] arg){

   try{

   ListString jars =
 Lists.newArrayList(/home/akhld/Desktop/tools/spark-9/jars/spark-assembly-0.9.0-incubating-hadoop2.3.0-mr1-cdh5.0.0.jar,

  /home/akhld/Downloads/sparkhbasecode/hbase-server-0.96.0-hadoop2.jar,

  /home/akhld/Downloads/sparkhbasecode/hbase-protocol-0.96.0-hadoop2.jar,


 /home/akhld/Downloads/sparkhbasecode/hbase-hadoop2-compat-0.96.0-hadoop2.jar,

  /home/akhld/Downloads/sparkhbasecode/hbase-common-0.96.0-hadoop2.jar,

  /home/akhld/Downloads/sparkhbasecode/hbase-client-0.96.0-hadoop2.jar,

  /home/akhld/Downloads/sparkhbasecode/htrace-core-2.02.jar);

  SparkConf spconf = new SparkConf();

  spconf.setMaster(local);

  spconf.setAppName(SparkHBase);

  spconf.setSparkHome(/home/akhld/Desktop/tools/spark-9);

  spconf.setJars(jars.toArray(new String[jars.size()]));

  spconf.set(spark.executor.memory, 1g);

  final JavaSparkContext sc = new JavaSparkContext(spconf);

   org.apache.hadoop.conf.Configuration conf = HBaseConfiguration.create();

  conf.addResource(/home/akhld/Downloads/sparkhbasecode/hbase-site.xml);

  conf.set(TableInputFormat.INPUT_TABLE, blogposts);

NewHadoopRDDImmutableBytesWritable, Result rdd = new
 NewHadoopRDDImmutableBytesWritable,
 Result(JavaSparkContext.toSparkContext(sc), TableInputFormat.class,
 ImmutableBytesWritable.class, Result.class, conf);

   JavaRDDTuple2ImmutableBytesWritable, Result jrdd = rdd.toJavaRDD();

   *ForEachFunction f = new ForEachFunction();*

 * JavaRDDIteratorString retrdd = jrdd.map(f);*



 System.out.println(Count = + retrdd.count());

   }catch(Exception e){

   e.printStackTrace();

  System.out.println(Crshed :  + e);

   }

   }

   @SuppressWarnings(serial)

 private static class ForEachFunction extends
 FunctionTuple2ImmutableBytesWritable, Result, IteratorString{

 *public IteratorString call(Tuple2ImmutableBytesWritable,
 Result test) {*

 *Result tmp = (Result) test._2;*

 * ListKeyValue kvl = tmp.getColumn(post.getBytes(),
 title.getBytes());*

 * for(KeyValue kl:kvl){*

 * String sb = new String(kl.getValue());*

 * System.out.println(Value : + sb);*

 * }*

 *return null;*

 *}*

  }


  }


 Hope it helps.


 Thanks
 Best Regards


 On Fri, Aug 1, 2014 at 4:44 PM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Akhil,

 Thank you for your response. I'm facing below issues.

 I'm not able to print the values. Am I missing any thing. Could you
 please look into this issue.

 JavaPairRDDImmutableBytesWritable, Result hBaseRDD =
 sc.newAPIHadoopRDD(
 conf,
 TableInputFormat.class,
 ImmutableBytesWritable.class,
 Result.class);

 System.out.println( ROWS COUNT = + hBaseRDD.count());

   JavaRDD R = hBaseRDD.map(new FunctionTuple2ImmutableBytesWritable,
 Result, IteratorString(){

 public IteratorString call(Tuple2ImmutableBytesWritable,
 Result test)
 {
 Result tmp = (Result) test._2;

 System.out.println(Inside );

 //ListKeyValue kvl = tmp.getColumn(post.getBytes(),
 title.getBytes());
 for(KeyValue kl:tmp.raw())

 {
 String sb = new String(kl.getValue());
 System.out.println(sb);
 }
 return null;
 }
 }
 );

 *Output :*

 ROWS COUNT = 8

 It is not printing Inside statement also. I think it is not going into
 this function.

 Could you please help me on this issue.

 Thank you for your

javasparksql Hbase

2014-07-28 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please let me know example program/link for JavaSparkSql to join
2 Hbase tables.

Regards,
Rajesh


spark checkpoint details

2014-07-27 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please help me on below query.

I'm using JavaStreamingContext to read streaming files from hdfs shared
directory. When i start spark streaming job it is reading files from hdfs
shared directory and doing some process. When i stop and restart the job it
is again reading old files. Is there any way to maintain checkpoint files
information in spark?


Re: Need help on spark Hbase

2014-07-16 Thread Madabhattula Rajesh Kumar
 badwclient.jar
 This worked for us.
 Cheers
 k/


 On Tue, Jul 15, 2014 at 6:47 AM, Madabhattula Rajesh Kumar 
 mrajaf...@gmail.com wrote:

 Hi Team,

 Could you please help me to resolve the issue.

 *Issue *: I'm not able to connect HBase from Spark-submit. Below is
 my code.  When i execute below program in standalone, i'm able to connect
 to Hbase and doing the operation.

 When i execute below program using spark submit ( ./bin/spark-submit
 ) command, i'm not able to connect to hbase. Am i missing any thing?


 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
 import java.util.Properties;

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.hbase.HBaseConfiguration;
 import org.apache.hadoop.hbase.client.Put;
 import org.apache.log4j.Logger;
 import org.apache.spark.SparkConf;
 import org.apache.spark.api.java.JavaRDD;
 import org.apache.spark.api.java.function.Function;
 import org.apache.spark.streaming.Duration;
 import org.apache.spark.streaming.api.java.JavaDStream;
 import org.apache.spark.streaming.api.java.JavaStreamingContext;
 import org.apache.hadoop.hbase.HTableDescriptor;
 import org.apache.hadoop.hbase.client.HBaseAdmin;

 public class Test {


 public static void main(String[] args) throws Exception {

 JavaStreamingContext ssc = new
 JavaStreamingContext(local,Test, new Duration(4), sparkHome, );

 JavaDStreamString lines_2 =
 ssc.textFileStream(hdfsfolderpath);

 Configuration configuration = HBaseConfiguration.create();
 configuration.set(hbase.zookeeper.property.clientPort,
 2181);
 configuration.set(hbase.zookeeper.quorum, localhost);
 configuration.set(hbase.master, localhost:60);

 HBaseAdmin hBaseAdmin = new HBaseAdmin(configuration);

 if (hBaseAdmin.tableExists(HABSE_TABLE)) {
 System.out.println( ANA_DATA table exists ..);
 }

 System.out.println( HELLO HELLO HELLO );

 ssc.start();
 ssc.awaitTermination();

 }
 }

 Thank you for your help and support.

 Regards,
 Rajesh









Re: Possible bug in Spark Streaming :: TextFileStream

2014-07-14 Thread Madabhattula Rajesh Kumar
Hi Team,

Is this issue with JavaStreamingContext.textFileStream(hdfsfolderpath)
API also? Please conform. If yes, could you please help me to fix this
issue. I'm using spark 1.0.0 version.

Regards,
Rajesh


On Tue, Jul 15, 2014 at 5:42 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:

 Oh yes, this was a bug and it has been fixed. Checkout from the master
 branch!


 https://issues.apache.org/jira/browse/SPARK-2362?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20Streaming%20ORDER%20BY%20created%20DESC%2C%20priority%20ASC

 TD


 On Mon, Jul 7, 2014 at 7:11 AM, Luis Ángel Vicente Sánchez 
 langel.gro...@gmail.com wrote:

 I have a basic spark streaming job that is watching a folder, processing
 any new file and updating a column family in cassandra using the new
 cassandra-spark-driver.

 I think there is a problem with SparkStreamingContext.textFileStream...
 if I start my job in local mode with no files in the folder that is watched
 and then I copy a bunch of files, sometimes spark is continually processing
 those files again and again.

 I have noticed that it usually happens when spark doesn't detect all new
 files in one go... i.e. I copied 6 files and spark detected 3 of them as
 new and processed them; then it detected the other 3 as new and processed
 them. After it finished to process all 6 files, it detected again the first
 3 files as new files and processed them... then the other 3... and again...
 and again... and again.

 Should I rise a JIRA issue?

 Regards,

 Luis





how to convert JavaDStreamString to JavaRDDString

2014-07-09 Thread Madabhattula Rajesh Kumar
Hi Team,

Could you please help me to resolve below query.

My use case is :

I'm using JavaStreamingContext to read text files from Hadoop - HDFS
directory

JavaDStreamString lines_2 =
ssc.textFileStream(hdfs://localhost:9000/user/rajesh/EventsDirectory/);

How to convert JavaDStreamString result to JavaRDDString? if we can
convert. I can use collect() method on JavaRDD and process my textfile.

I'm not able to find collect method on JavaRDDString.

Thank you very much in advance.

Regards,
Rajesh