Re: Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-05-01 Thread HARSH TAKKAR
We scanned 3 versions of spark 3.0.0, 3.1.3, 3.2.1



On Tue, 26 Apr, 2022, 18:46 Bjørn Jørgensen, 
wrote:

> What version of spark is it that you have scanned?
>
>
>
> tir. 26. apr. 2022 kl. 12:48 skrev HARSH TAKKAR :
>
>> Hello,
>>
>> Please let me know if there is a fix available for following
>> vulnerabilities in htrace jar used in spark jars folder.
>>
>> LIBRARY: com.fasterxml.jackson.core:jackson-databind
>>
>> VULNERABILITY IDs :
>>
>>
>>
>>
>>
>>   CVE-2020-9548
>>
>>
>>   CVE-2020-9547
>>
>>
>>   CVE-2020-8840
>>
>>
>>   CVE-2020-36179
>>
>>
>>   CVE-2020-35491
>>
>>
>>   CVE-2020-35490
>>
>>
>>   CVE-2020-25649
>>
>>
>>   CVE-2020-24750
>>
>>
>>   CVE-2020-24616
>>
>>
>>   CVE-2020-10673
>>
>>
>>   CVE-2019-20330
>>
>>
>>   CVE-2019-17531
>>
>>
>>   CVE-2019-17267
>>
>>
>>   CVE-2019-16943
>>
>>
>>   CVE-2019-16942
>>
>>
>>   CVE-2019-16335
>>
>>
>>   CVE-2019-14893
>>
>>
>>   CVE-2019-14892
>>
>>
>>   CVE-2019-14540
>>
>>
>>   CVE-2019-14439
>>
>>
>>   CVE-2019-14379
>>
>>
>>   CVE-2019-12086
>>
>>
>>   CVE-2018-7489
>>
>>
>>   CVE-2018-5968
>>
>>
>>   CVE-2018-14719
>>
>>
>>   CVE-2018-14718
>>
>>
>>   CVE-2018-12022
>>
>>
>>   CVE-2018-11307
>>
>>
>>   CVE-2017-7525
>>
>>
>>   CVE-2017-17485
>>
>>
>>
>>
>>   CVE-2017-15095
>>
>>
>> Kind Regards
>>
>> Harsh Takkar
>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Vulnerabilities in htrace-core4-4.1.0-incubating.jar jar used in spark.

2022-04-26 Thread HARSH TAKKAR
Hello,

Please let me know if there is a fix available for following
vulnerabilities in htrace jar used in spark jars folder.

LIBRARY: com.fasterxml.jackson.core:jackson-databind

VULNERABILITY IDs :





  CVE-2020-9548


  CVE-2020-9547


  CVE-2020-8840


  CVE-2020-36179


  CVE-2020-35491


  CVE-2020-35490


  CVE-2020-25649


  CVE-2020-24750


  CVE-2020-24616


  CVE-2020-10673


  CVE-2019-20330


  CVE-2019-17531


  CVE-2019-17267


  CVE-2019-16943


  CVE-2019-16942


  CVE-2019-16335


  CVE-2019-14893


  CVE-2019-14892


  CVE-2019-14540


  CVE-2019-14439


  CVE-2019-14379


  CVE-2019-12086


  CVE-2018-7489


  CVE-2018-5968


  CVE-2018-14719


  CVE-2018-14718


  CVE-2018-12022


  CVE-2018-11307


  CVE-2017-7525


  CVE-2017-17485




  CVE-2017-15095


Kind Regards

Harsh Takkar


Unsubscribe

2021-11-17 Thread HARSH TAKKAR
Unsubscribe


Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread HARSH TAKKAR
Hello Sean,

Thanks for the advice, can you please point me to an example where i can
find a custom wrapper for python.


Kind Regards
Harsh Takkar

On Tue, 16 Feb, 2021, 8:25 pm Sean Owen,  wrote:

> You won't be able to use it in python if it is implemented in Java - needs
> a python wrapper too.
>
> On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR  wrote:
>
>> Hi ,
>>
>> I have created a custom Estimator in scala, which i can use successfully
>> by creating a pipeline model in Java and scala, But when i try to load the
>> pipeline model saved using scala api in pyspark, i am getting an error
>> saying module not found.
>>
>> I have included my custom model jar in the class pass using "spark.jars"
>>
>> Can you please help, if i am missing something.
>>
>> Kind Regards
>> Harsh Takkar
>>
>


Using Custom Scala Spark ML Estimator in PySpark

2021-02-15 Thread HARSH TAKKAR
Hi ,

I have created a custom Estimator in scala, which i can use successfully by
creating a pipeline model in Java and scala, But when i try to load the
pipeline model saved using scala api in pyspark, i am getting an error
saying module not found.

I have included my custom model jar in the class pass using "spark.jars"

Can you please help, if i am missing something.

Kind Regards
Harsh Takkar


Re: How to enable hive support on an existing Spark session?

2020-05-27 Thread HARSH TAKKAR
Hi Kun,

You can use following spark property instead while launching the app
instead of manually enabling it in the code.

spark.sql.catalogImplementation=hive


Kind Regards
Harsh

On Tue, May 26, 2020 at 9:55 PM Kun Huang (COSMOS)
 wrote:

>
> Hi Spark experts,
>
> I am seeking for an approach to enable hive support manually on an
> existing Spark session.
>
> Currently, HiveContext seems the best way for my scenario. However, this
> class has already been marked as deprecated and it is recommended to use
> SparkSession.builder.enableHiveSupport(). This should be called before
> creating Spark session.
>
> I wonder if there are other workaround?
>
> Thanks,
> Kun
>


Structured Streaming using Kafka Avro Record in 2.3.0

2020-04-28 Thread HARSH TAKKAR
Hi

How can we deserialise avro record read from kafka in spark 2.3.0 in
optimised manner. I could see that native support for avro was added in
2.4.x.

Currently i am using following library which is very slow.


com.twitter
bijection-avro_2.11



Kind Regards
Harsh Takkar


Reading 7z file in spark

2020-01-13 Thread HARSH TAKKAR
Hi,


Is it possible to read 7z compressed file in spark?


Kind Regards
Harsh Takkar


Re: Hive External Table Partiton Data Type.

2019-12-16 Thread HARSH TAKKAR
Hi,


I am able to make it in spark 2.3.0, If you can change the version to spark
2.3 it will be good , otherwise let me know , i 'll check on the Spark
version 2.1.0.

Following is the code for spark 2.3.0.

scala> var seq = Seq((10L,"Hello"),(10L,"Hi"))
seq: Seq[(Long, String)] = List((10,Hello), (10,Hi))

scala> seq.toDF("a","b")
res0: org.apache.spark.sql.DataFrame = [a: bigint, b: string]

scala> res0.show()
+---+-+
|  a|b|
+---+-+
| 10|Hello|
| 10|   Hi|
+---+-+

scala> res0.printSchema
root
 |-- a: long (nullable = false)
 |-- b: string (nullable = true)


scala>
res0.write.option("path","/tmp/longpartition1").mode("overwrite").partitionBy("a").saveAsTable("default.longPartition")


hive> select * from longpartition
> ;
OK
Hello 10
Hi 10
Time taken: 0.356 seconds, Fetched: 2 row(s)
hive> describe longpartition;
OK
b   string
a   bigint

# Partition Information
# col_name data_type   comment

a   bigint

On Mon, Dec 16, 2019 at 11:05 AM SB M  wrote:

> spark version 2.1.0
>
> Regards,
> Sbm
>
> On Mon, 16 Dec, 2019, 10:04 HARSH TAKKAR,  wrote:
>
>> Please share the spark version you are using .
>>
>> On Fri, 13 Dec, 2019, 4:02 PM SB M,  wrote:
>>
>>> Hi All,
>>>Am trying to create a dynamic partition with external table on hive
>>> metastore using spark sql.
>>>
>>> when am trying to create a partition column data type as bigint,
>>> partition is not working even i tried with repair table. data is not shown
>>> when i ran sample query select * from table.
>>>
>>>
>>> but if i tried to create a dynamic partition with string as data type
>>> for partition its working fine. partition are working as expected.
>>>
>>>
>>> is there something am doing it wrong ??
>>>
>>>
>>> Regards,
>>> Sree
>>>
>>>


Re: Hive External Table Partiton Data Type.

2019-12-15 Thread HARSH TAKKAR
Please share the spark version you are using .

On Fri, 13 Dec, 2019, 4:02 PM SB M,  wrote:

> Hi All,
>Am trying to create a dynamic partition with external table on hive
> metastore using spark sql.
>
> when am trying to create a partition column data type as bigint, partition
> is not working even i tried with repair table. data is not shown when i ran
> sample query select * from table.
>
>
> but if i tried to create a dynamic partition with string as data type for
> partition its working fine. partition are working as expected.
>
>
> is there something am doing it wrong ??
>
>
> Regards,
> Sree
>
>


Re: Unable to write data from Spark into a Hive Managed table

2019-08-20 Thread HARSH TAKKAR
Please refere to the following documentation on how to write data into hive
in hdp3.1

https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.1.0/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html



Harsh

On Fri, 9 Aug, 2019, 10:21 PM Mich Talebzadeh, 
wrote:

> Check your permissioning.
>
> Can you do insert select from external table into Hive managed table
> created by spark?
>
> //
> // Need to create and populate target ORC table transactioncodes_ll in
> database accounts.in Hive
> //
> HiveContext.sql("use accounts")
> //
> // Drop and create table transactioncodes_ll
> //
> spark.sql("DROP TABLE IF EXISTS accounts.transactioncodes_ll")
> var sqltext = ""
> sqltext = """
> CREATE TABLE accounts.transactioncodes_ll (
> id Int
> ,transactiontype   String
> ,description   String
> )
> COMMENT 'from Hive external table'
> STORED AS ORC
> TBLPROPERTIES ( "orc.compress"="ZLIB" )
> """
> HiveContext.sql(sqltext)
> //
> // Put data in Hive table. Clean up is already done
> //
> sqltext = """
> INSERT INTO TABLE accounts.transactioncodes_ll
> SELECT
>   id
> , TRIM(transactiontype)
> , TRIM(description)
> FROM 
> """
> spark.sql(sqltext)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 9 Aug 2019 at 12:04, Debabrata Ghosh 
> wrote:
>
>> Hi ,
>>   I am using Hortonworks Data Platform 3.1. I am unable to
>> write data from Spark into a Hive Managed table but am able to do so in a
>> Hive External table.
>>
>>   Would you please help get me with a resolution.
>>
>> Thanks,
>> Debu
>>
>


Re: Back pressure not working on streaming

2019-01-01 Thread HARSH TAKKAR
There is separate property for max rate , by default is is not set, so if
you want to limit the max rate you should  provide that property  a value.

Initial rate =10 means it will pick only 10 records per receiver in the
batch interval when you start the process.

Depending  upon the consumption rate it will increase  the consumption of
records for processing in each batch.

However i, feel 10 is way to low number for 32 partitioned kafka topic.



Regards
Harsh
Happy New Year

On Wed 2 Jan, 2019, 08:33 JF Chen  I have set  spark.streaming.backpressure.enabled to true,  
> spark.streaming.backpressure.initialRate
> to 10.
> Once my application started, it received 32 million messages on first
> batch.
> My application runs every 300 seconds, with 32 kafka partition. So what's
> is the max rate if I set initial rate to 10?
>
> Thanks!
>
>
> Regard,
> Junfeng Chen
>


Re: executing stored procedure through spark

2018-08-13 Thread HARSH TAKKAR
Hi

You can call the java program directly though pyspark,

Following is the code that will help.

sc._jvm..


Harsh Takkar

On Sun, Aug 12, 2018 at 9:27 PM amit kumar singh 
wrote:

> Hi /team,
>
> The way we call java program to executed stored procedure
> is there any way we can achieve the same using pyspark
>
>


Re: Pyspark access to scala/java libraries

2018-07-18 Thread HARSH TAKKAR
Hi

You can access your java packages using following in pySpark

obj = sc._jvm.yourPackage.className()


Kind Regards
Harsh Takkar

On Wed, Jul 18, 2018 at 4:00 AM Mohit Jaggi  wrote:

> Thanks 0xF0F0F0 and Ashutosh for the pointers.
>
> Holden,
> I am trying to look into sparklingml...what am I looking for? Also which
> chapter/page of your book should I look at?
>
> Mohit.
>
> On Sun, Jul 15, 2018 at 3:02 AM Holden Karau 
> wrote:
>
>> If you want to see some examples in a library shows a way to do it -
>> https://github.com/sparklingpandas/sparklingml and high performance
>> spark also talks about it.
>>
>> On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote:
>>
>>> Check
>>> https://stackoverflow.com/questions/31684842/calling-java-scala-function-from-a-task
>>>
>>> ​Sent with ProtonMail Secure Email.​
>>>
>>> ‐‐‐ Original Message ‐‐‐
>>>
>>> On July 15, 2018 8:01 AM, Mohit Jaggi  wrote:
>>>
>>> > Trying again…anyone know how to make this work?
>>> >
>>> > > On Jul 9, 2018, at 3:45 PM, Mohit Jaggi mohitja...@gmail.com wrote:
>>> > >
>>> > > Folks,
>>> > >
>>> > > I am writing some Scala/Java code and want it to be usable from
>>> pyspark.
>>> > >
>>> > > For example:
>>> > >
>>> > > class MyStuff(addend: Int) {
>>> > >
>>> > > def myMapFunction(x: Int) = x + addend
>>> > >
>>> > > }
>>> > >
>>> > > I want to call it from pyspark as:
>>> > >
>>> > > df = ...
>>> > >
>>> > > mystuff = sc._jvm.MyStuff(5)
>>> > >
>>> > > df[‘x’].map(lambda x: mystuff.myMapFunction(x))
>>> > >
>>> > > How can I do this?
>>> > >
>>> > > Mohit.
>>> >
>>> > --
>>> >
>>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>


Sklearn model in pyspark prediction

2018-05-15 Thread HARSH TAKKAR
Hi,

Is there a way to load model saved using sklearn lib in pyspark/ scala
spark for prediction.


Thanks


Data of ArrayType field getting truncated when saving to parquet

2018-01-31 Thread HARSH TAKKAR
Hi

I have a dataframe with a field of type array which is of large size, when
i am trying to save the data to parquet file and read it again , array
field comes out as empty array.

Please help


Harsh


Does Random Forest in spark ML supports multi label classification in scala

2017-11-07 Thread HARSH TAKKAR
Hi

Does Random Forest in spark Ml supports multi label classification in scala
?

I found out, sklearn provides sklearn.ensemble.RandomForestClassifier in
python, do we have the similar functionality in scala ?


Building Spark with hive 1.1.0

2017-11-06 Thread HARSH TAKKAR
Hi

I am using the cloudera (cdh5.11.0) setup, which have the hive version as
1.1.0, but when i build spark with hive and thrift support it pack the hive
version as 1.6.0,

Please let me know how can i build spark with hive 1.1.0 ?

command i am using to build :
./dev/make-distribution.sh --name mySpark --tgz -Phadoop-2.6 -Phive
-Phive-thriftserver -Pyarn


Error with hive 1.6.0 on cdh5.11.0
Required field 'client_protocol' is unset!
Struct:TOpenSessionReq(client_protocol:null,
configuration:{use:database=default})


Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-19 Thread HARSH TAKKAR
Thanks Cody,

It worked for me buy keeping num executor with each having 1 core = num of
partitions of kafka.



On Mon, Sep 18, 2017 at 8:47 PM Cody Koeninger <c...@koeninger.org> wrote:

> Have you searched in jira, e.g.
>
> https://issues.apache.org/jira/browse/SPARK-19185
>
> On Mon, Sep 18, 2017 at 1:56 AM, HARSH TAKKAR <takkarha...@gmail.com>
> wrote:
> > Hi
> >
> > Changing spark version if my last resort, is there any other workaround
> for
> > this problem.
> >
> >
> > On Mon, Sep 18, 2017 at 11:43 AM pandees waran <pande...@gmail.com>
> wrote:
> >>
> >> All, May I know what exactly changed in 2.1.1 which solved this problem?
> >>
> >> Sent from my iPhone
> >>
> >> On Sep 17, 2017, at 11:08 PM, Anastasios Zouzias <zouz...@gmail.com>
> >> wrote:
> >>
> >> Hi,
> >>
> >> I had a similar issue using 2.1.0 but not with Kafka. Updating to 2.1.1
> >> solved my issue. Can you try with 2.1.1 as well and report back?
> >>
> >> Best,
> >> Anastasios
> >>
> >> Am 17.09.2017 16:48 schrieb "HARSH TAKKAR" <takkarha...@gmail.com>:
> >>
> >>
> >> Hi
> >>
> >> I am using spark 2.1.0 with scala  2.11.8, and while iterating over the
> >> partitions of each rdd in a dStream formed using KafkaUtils, i am
> getting
> >> the below exception, please suggest a fix.
> >>
> >> I have following config
> >>
> >> kafka :
> >> enable.auto.commit:"true",
> >> auto.commit.interval.ms:"1000",
> >> session.timeout.ms:"3",
> >>
> >> Spark:
> >>
> >> spark.streaming.backpressure.enabled=true
> >>
> >> spark.streaming.kafka.maxRatePerPartition=200
> >>
> >>
> >> Exception in task 0.2 in stage 3236.0 (TID 77795)
> >> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
> >> multi-threaded access
> >>
> >> --
> >> Kind Regards
> >> Harsh
> >>
> >>
> >
>


Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-18 Thread HARSH TAKKAR
Hi

Changing spark version if my last resort, is there any other workaround for
this problem.


On Mon, Sep 18, 2017 at 11:43 AM pandees waran <pande...@gmail.com> wrote:

> All, May I know what exactly changed in 2.1.1 which solved this problem?
>
> Sent from my iPhone
>
> On Sep 17, 2017, at 11:08 PM, Anastasios Zouzias <zouz...@gmail.com>
> wrote:
>
> Hi,
>
> I had a similar issue using 2.1.0 but not with Kafka. Updating to 2.1.1
> solved my issue. Can you try with 2.1.1 as well and report back?
>
> Best,
> Anastasios
>
> Am 17.09.2017 16:48 schrieb "HARSH TAKKAR" <takkarha...@gmail.com>:
>
>
> Hi
>
> I am using spark 2.1.0 with scala  2.11.8, and while iterating over the
> partitions of each rdd in a dStream formed using KafkaUtils, i am getting
> the below exception, please suggest a fix.
>
> I have following config
>
> kafka :
> enable.auto.commit:"true",
> auto.commit.interval.ms:"1000",
> session.timeout.ms:"3",
>
> Spark:
>
> spark.streaming.backpressure.enabled=true
>
> spark.streaming.kafka.maxRatePerPartition=200
>
>
> Exception in task 0.2 in stage 3236.0 (TID 77795)
> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
> multi-threaded access
>
> --
> Kind Regards
> Harsh
>
>
>


Re: ConcurrentModificationException using Kafka Direct Stream

2017-09-17 Thread HARSH TAKKAR
Hi,

No we are not creating any thread for kafka DStream
however, we have a single thread for refreshing a resource cache on driver,
but that is totally separate to this connection.

On Mon, Sep 18, 2017 at 12:29 AM kant kodali <kanth...@gmail.com> wrote:

> Are you creating threads in your application?
>
> On Sun, Sep 17, 2017 at 7:48 AM, HARSH TAKKAR <takkarha...@gmail.com>
> wrote:
>
>>
>> Hi
>>
>> I am using spark 2.1.0 with scala  2.11.8, and while iterating over the
>> partitions of each rdd in a dStream formed using KafkaUtils, i am getting
>> the below exception, please suggest a fix.
>>
>> I have following config
>>
>> kafka :
>> enable.auto.commit:"true",
>> auto.commit.interval.ms:"1000",
>> session.timeout.ms:"3",
>>
>> Spark:
>>
>> spark.streaming.backpressure.enabled=true
>>
>> spark.streaming.kafka.maxRatePerPartition=200
>>
>>
>> Exception in task 0.2 in stage 3236.0 (TID 77795)
>> java.util.ConcurrentModificationException: KafkaConsumer is not safe for
>> multi-threaded access
>>
>> --
>> Kind Regards
>> Harsh
>>
>
>


ConcurrentModificationException using Kafka Direct Stream

2017-09-17 Thread HARSH TAKKAR
Hi

I am using spark 2.1.0 with scala  2.11.8, and while iterating over the
partitions of each rdd in a dStream formed using KafkaUtils, i am getting
the below exception, please suggest a fix.

I have following config

kafka :
enable.auto.commit:"true",
auto.commit.interval.ms:"1000",
session.timeout.ms:"3",

Spark:

spark.streaming.backpressure.enabled=true

spark.streaming.kafka.maxRatePerPartition=200


Exception in task 0.2 in stage 3236.0 (TID 77795)
java.util.ConcurrentModificationException: KafkaConsumer is not safe for
multi-threaded access

--
Kind Regards
Harsh


update hive metastore in spark session at runtime

2017-09-01 Thread HARSH TAKKAR
Hi,

I have just started using spark session, with hive enabled. but i am facing
some issue while updating hive warehouse directory post spark session
creation,

usecase: i want to read data from hive one cluster and write to hive on
another cluster

Please suggest if this can be done?


Reading parquet file in stream

2017-08-16 Thread HARSH TAKKAR
Hi

I want to read a hdfs directory which contains parquet files, how can i
stream data from this directory using streaming context (ssc.fileStream) ?


Harsh


Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread HARSH TAKKAR
Hi

I can see that exception is caused by following, csn you check where in
your code you are using this path

Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does
not exist:
hdfs://testcluster:8020/experiments/vol/spark_chomp_data/bak/restaurants-bak/latest

On Wed, 17 Aug 2016, 10:57 p.m. max square,  wrote:

> /bump
>
> It'd be great if someone can point me to the correct direction.
>
> On Mon, Aug 8, 2016 at 5:07 PM, max square 
> wrote:
>
>> Here's the complete stacktrace -
>> https://gist.github.com/rohann/649b0fcc9d5062ef792eddebf5a315c1
>>
>> For reference, here's the complete function -
>>
>>   def saveAsLatest(df: DataFrame, fileSystem: FileSystem, bakDir:
>> String) = {
>>
>> fileSystem.rename(new Path(bakDir + latest), new Path(bakDir + "/" +
>> ScalaUtil.currentDateTimeString))
>>
>> df.write.format("com.databricks.spark.csv").options(Map("mode" ->
>> "DROPMALFORMED", "delimiter" -> "\t", "header" -> "true")).save(bakDir +
>> latest)
>>
>>   }
>>
>> On Mon, Aug 8, 2016 at 3:41 PM, Ted Yu  wrote:
>>
>>> Mind showing the complete stack trace ?
>>>
>>> Thanks
>>>
>>> On Mon, Aug 8, 2016 at 12:30 PM, max square 
>>> wrote:
>>>
 Thanks Ted for the prompt reply.

 There are three or four DFs that are coming from various sources and
 I'm doing a unionAll on them.

 val placesProcessed = placesUnchanged.unionAll(
 placesAddedWithMerchantId).unionAll(
 placesUpdatedFromHotelsWithMerchantId).unionAll(
 placesUpdatedFromRestaurantsWithMerchantId).unionAll(placesChanged)

 I'm using Spark 1.6.2.

 On Mon, Aug 8, 2016 at 3:11 PM, Ted Yu  wrote:

> Can you show the code snippet for unionAll operation ?
>
> Which Spark release do you use ?
>
> BTW please use user@spark.apache.org in the future.
>
> On Mon, Aug 8, 2016 at 11:47 AM, max square 
> wrote:
>
>> Hey guys,
>>
>> I'm trying to save Dataframe in CSV format after performing unionAll
>> operations on it.
>> But I get this exception -
>>
>> Exception in thread "main"
>> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute,
>> tree:
>> TungstenExchange hashpartitioning(mId#430,200)
>>
>> I'm saving it by
>>
>> df.write.format("com.databricks.spark.csv").options(Map("mode" ->
>> "DROPMALFORMED", "delimiter" -> "\t", "header" -> "true")).save(bakDir
>> + latest)
>>
>> It works perfectly if I don't do the unionAll operation.
>> I see that the format isn't different by printing the part of the
>> results.
>>
>> Any help regarding this would be appreciated.
>>
>>
>

>>>
>>
>


Re: Updating Values Inside Foreach Rdd loop

2016-05-09 Thread HARSH TAKKAR
Hi

Please help.

On Sat, 7 May 2016, 11:43 p.m. HARSH TAKKAR, <takkarha...@gmail.com> wrote:

> Hi Ted
>
> Following is my use case.
>
> I have a prediction algorithm where i need to update some records to
> predict the target.
>
> For eg.
> I have an eq. Y=  mX +c
> I need to change value of Xi of some records and calculate sum(Yi) if the
> value of prediction is not close to target value then repeat the process.
>
> In each iteration different set of values are updated but result is
> checked when we sum up the values.
>
> On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yuzhih...@gmail.com> wrote:
>
>> Using RDDs requires some 'low level' optimization techniques.
>> While using dataframes / Spark SQL allows you to leverage existing code.
>>
>> If you can share some more of your use case, that would help other people
>> provide suggestions.
>>
>> Thanks
>>
>> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <takkarha...@gmail.com> wrote:
>>
>> Hi Ted
>>
>> I am aware that rdd are immutable, but in my use case i need to update
>> same data set after each iteration.
>>
>> Following are the points which i was exploring.
>>
>> 1. Generating rdd in each iteration.( It might use a lot of memory).
>>
>> 2. Using Hive tables and update the same table after each iteration.
>>
>> Please suggest,which one of the methods listed above will be good to use
>> , or is there are more better ways to accomplish it.
>>
>> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote:
>>
>>> Please see the doc at the beginning of RDD class:
>>>
>>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in
>>> Spark. Represents an immutable,
>>>  * partitioned collection of elements that can be operated on in
>>> parallel. This class contains the
>>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>>> `persist`. In addition,
>>>
>>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com>
>>> wrote:
>>>
>>>> Hi
>>>>
>>>> Is there a way i can modify a RDD, in for-each loop,
>>>>
>>>> Basically, i have a use case in which i need to perform multiple
>>>> iteration over data and modify few values in each iteration.
>>>>
>>>>
>>>> Please help.
>>>>
>>>
>>>


Re: Updating Values Inside Foreach Rdd loop

2016-05-07 Thread HARSH TAKKAR
Hi Ted

Following is my use case.

I have a prediction algorithm where i need to update some records to
predict the target.

For eg.
I have an eq. Y=  mX +c
I need to change value of Xi of some records and calculate sum(Yi) if the
value of prediction is not close to target value then repeat the process.

In each iteration different set of values are updated but result is checked
when we sum up the values.

On Sat, 7 May 2016, 8:58 a.m. Ted Yu, <yuzhih...@gmail.com> wrote:

> Using RDDs requires some 'low level' optimization techniques.
> While using dataframes / Spark SQL allows you to leverage existing code.
>
> If you can share some more of your use case, that would help other people
> provide suggestions.
>
> Thanks
>
> On May 6, 2016, at 6:57 PM, HARSH TAKKAR <takkarha...@gmail.com> wrote:
>
> Hi Ted
>
> I am aware that rdd are immutable, but in my use case i need to update
> same data set after each iteration.
>
> Following are the points which i was exploring.
>
> 1. Generating rdd in each iteration.( It might use a lot of memory).
>
> 2. Using Hive tables and update the same table after each iteration.
>
> Please suggest,which one of the methods listed above will be good to use ,
> or is there are more better ways to accomplish it.
>
> On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote:
>
>> Please see the doc at the beginning of RDD class:
>>
>>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
>> Represents an immutable,
>>  * partitioned collection of elements that can be operated on in
>> parallel. This class contains the
>>  * basic operations available on all RDDs, such as `map`, `filter`, and
>> `persist`. In addition,
>>
>> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> Is there a way i can modify a RDD, in for-each loop,
>>>
>>> Basically, i have a use case in which i need to perform multiple
>>> iteration over data and modify few values in each iteration.
>>>
>>>
>>> Please help.
>>>
>>
>>


Re: Updating Values Inside Foreach Rdd loop

2016-05-06 Thread HARSH TAKKAR
Hi Ted

I am aware that rdd are immutable, but in my use case i need to update same
data set after each iteration.

Following are the points which i was exploring.

1. Generating rdd in each iteration.( It might use a lot of memory).

2. Using Hive tables and update the same table after each iteration.

Please suggest,which one of the methods listed above will be good to use ,
or is there are more better ways to accomplish it.

On Fri, 6 May 2016, 7:09 p.m. Ted Yu, <yuzhih...@gmail.com> wrote:

> Please see the doc at the beginning of RDD class:
>
>  * A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
> Represents an immutable,
>  * partitioned collection of elements that can be operated on in parallel.
> This class contains the
>  * basic operations available on all RDDs, such as `map`, `filter`, and
> `persist`. In addition,
>
> On Fri, May 6, 2016 at 5:25 AM, HARSH TAKKAR <takkarha...@gmail.com>
> wrote:
>
>> Hi
>>
>> Is there a way i can modify a RDD, in for-each loop,
>>
>> Basically, i have a use case in which i need to perform multiple
>> iteration over data and modify few values in each iteration.
>>
>>
>> Please help.
>>
>
>


Updating Values Inside Foreach Rdd loop

2016-05-06 Thread HARSH TAKKAR
Hi

Is there a way i can modify a RDD, in for-each loop,

Basically, i have a use case in which i need to perform multiple iteration
over data and modify few values in each iteration.


Please help.


Re: [Please Help] Log redirection on EMR

2016-02-23 Thread HARSH TAKKAR
Hi Sabarish

Thanks for your help, i was able to get the logs from archive, is there a
way i can adjust archival policy, say i want to persist the logs of last 2
jobs on resource manage and archive others on the file system.


On Mon, Feb 22, 2016 at 12:25 PM Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Your logs are getting archived in your logs bucket in S3.
>
>
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-debugging.html
>
> Regards
> Sab
>
> On Mon, Feb 22, 2016 at 12:14 PM, HARSH TAKKAR <takkarha...@gmail.com>
> wrote:
>
>> Hi
>>
>> In am using an EMR cluster  for running my spark jobs, but after the job
>> finishes logs disappear,
>>
>> I have added a log4j.properties in my jar, but all the logs still
>> redirects to EMR resource manager which vanishes after jobs completes, is
>> there a way i could redirect the logs to a location in file  syatem, I am
>> working on price points and its very critical for me to maintain logs.
>>
>> Just to add i get following error when my application starts.
>>
>> java.io.FileNotFoundException: /etc/spark/conf/log4j.properties (No such 
>> file or directory)
>>  at java.io.FileInputStream.open(Native Method)
>>  at java.io.FileInputStream.(FileInputStream.java:146)
>>  at java.io.FileInputStream.(FileInputStream.java:101)
>>  at 
>> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
>>  at 
>> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
>>  at 
>> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
>>  at 
>> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
>>  at org.apache.log4j.LogManager.(LogManager.java:127)
>>  at org.apache.spark.Logging$class.initializeLogging(Logging.scala:122)
>>  at 
>> org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:107)
>>  at org.apache.spark.Logging$class.log(Logging.scala:51)
>>  at 
>> org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:607)
>>  at 
>> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:621)
>>  at 
>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
>>
>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>


[Please Help] Log redirection on EMR

2016-02-21 Thread HARSH TAKKAR
Hi

In am using an EMR cluster  for running my spark jobs, but after the job
finishes logs disappear,

I have added a log4j.properties in my jar, but all the logs still redirects
to EMR resource manager which vanishes after jobs completes, is there a way
i could redirect the logs to a location in file  syatem, I am working on
price points and its very critical for me to maintain logs.

Just to add i get following error when my application starts.

java.io.FileNotFoundException: /etc/spark/conf/log4j.properties (No
such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at java.io.FileInputStream.(FileInputStream.java:101)
at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.(LogManager.java:127)
at org.apache.spark.Logging$class.initializeLogging(Logging.scala:122)
at 
org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:107)
at org.apache.spark.Logging$class.log(Logging.scala:51)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.log(ApplicationMaster.scala:607)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:621)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)


Using Java spring injection with spark

2016-02-01 Thread HARSH TAKKAR
>
> Hi
>
> I am new to apache spark and big data analytics, before starting to code
> on spark data frames and rdd, i just wanted to confirm following
>
> 1. Can we create an implementation of java.api.Function as a singleton
> bean using the spring frameworks and, can it be injected using autowiring
> to other classes.
>
> 2. what is the best way to submit jobs to spark , using the api or using
> the shell script?
>
> Looking forward for your help,
>
>
> Kind Regards
> Harsh
>


Re: Using Java spring injection with spark

2016-02-01 Thread HARSH TAKKAR
Hi

Please can anyone reply on this.

On Mon, 1 Feb 2016, 4:28 p.m. HARSH TAKKAR <takkarha...@gmail.com> wrote:

> Hi
>>
>> I am new to apache spark and big data analytics, before starting to code
>> on spark data frames and rdd, i just wanted to confirm following
>>
>> 1. Can we create an implementation of java.api.Function as a singleton
>> bean using the spring frameworks and, can it be injected using autowiring
>> to other classes.
>>
>> 2. what is the best way to submit jobs to spark , using the api or using
>> the shell script?
>>
>> Looking forward for your help,
>>
>>
>> Kind Regards
>> Harsh
>>
>


Re: Using Java spring injection with spark

2016-02-01 Thread HARSH TAKKAR
Hi Sambit

My app is basically a cron which checks on the db, if there is a job that
is scheduled and needs to be executed, and  it submits the job to spark
using spark java api.This app is written with spring framework as core.

Each job has set of task which needs to be executed in an order.
> we have implemented a chain of responsibility pattern to do so,and we
persist a status of snapshot step on mysql after persistence of the step
result in csv file, So that if job breaks in between we know which file to
pick and start processing further.

> I basically wanted to inject the snapshot steps with jdbc layer and other
dependency through spring auto-wiring.

will using spring in this use case adversely affect the performance and as
you mentioned will it cause serialization errors ?


On Tue, Feb 2, 2016 at 1:16 AM Sambit Tripathy (RBEI/EDS1) <
sambit.tripa...@in.bosch.com> wrote:

>
>
> 1.  It depends on what you want to do. Don’t worry about singleton
> and wiring the beans as it is pretty much taken care by the Spark Framework
> itself. Infact doing so, you will run into issues like serialization errors.
>
>
>
> 2.  You can write your code using Scala/ Python using the spark shell
> or a notebook like Ipython, zeppelin  or if you have written a application
> using Scala/Java using the Spark API you can create a jar and run it using
> spark-submit.
>
> *From:* HARSH TAKKAR [mailto:takkarha...@gmail.com]
> *Sent:* Monday, February 01, 2016 10:00 AM
> *To:* user@spark.apache.org
> *Subject:* Re: Using Java spring injection with spark
>
>
>
> Hi
>
> Please can anyone reply on this.
>
>
>
> On Mon, 1 Feb 2016, 4:28 p.m. HARSH TAKKAR <takkarha...@gmail.com> wrote:
>
> Hi
>
> I am new to apache spark and big data analytics, before starting to code
> on spark data frames and rdd, i just wanted to confirm following
>
> 1. Can we create an implementation of java.api.Function as a singleton
> bean using the spring frameworks and, can it be injected using autowiring
> to other classes.
>
> 2. what is the best way to submit jobs to spark , using the api or using
> the shell script?
>
> Looking forward for your help,
>
> Kind Regards
>
> Harsh
>
>