Re: ivy unit test case filing for Spark

2021-12-21 Thread Wes Peng
Are you using IvyVPN which causes this problem? If the VPN software changes
the network URL silently you should avoid using them.

Regards.

On Wed, Dec 22, 2021 at 1:48 AM Pralabh Kumar 
wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>


Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-21 Thread Sean Owen
16000 joins is never going to work out, though you can do it all at once
and avoid the immediate issue. If they really are the same rows in the same
order, maybe you can read them as lines of text and use zip()

On Tue, Dec 21, 2021, 8:48 AM Andrew Davidson 
wrote:

> Hi Jun
>
> Thank you for your reply. My question is what is best practices? My for
> loop run over 16000 joins. I get an out of memory exception.
>
> What is the indented use of createOrReplaceTempView if I need to manage
> the cache or create a uniq name each time
>
>
>
> Kind regards
>
> Andy
>
> On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu 
> wrote:
>
>> Hi
>>
>> As far as I know. The warning should be caused by create same temp view
>> names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>> You create a view "rawCounts", then in for loop, second round, you create
>> a new view with name "rawCounts", spark3 would uncache the
>> previous "rawCounts".
>>
>> Correct me if I'm wrong.
>>
>> Regards
>>
>>
>> On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson
>>  wrote:
>>
>>> Happy Holidays
>>>
>>>
>>>
>>> I am a newbie
>>>
>>>
>>>
>>> I have 16,000 data files, all files have the same number of rows and
>>> columns. The row ids are identical and are in the same order. I want to
>>> create a new data frame that contains the 3rd column from each data file.
>>> My pyspark script runs correctly when I test on small number of files how
>>> ever I get an OOM when I run on all 16000.
>>>
>>>
>>>
>>> To try and debug I ran a small test and set warning level to INFO. I
>>> found the following
>>>
>>>
>>>
>>> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
>>> `rawCounts` before replacing.
>>>
>>>
>>>
>>> for i in range( 1, len( self.sampleNamesList ) ):
>>>
>>> sampleName = self.sampleNamesList[i]
>>>
>>>
>>>
>>> # select the key and counts from the sample.
>>>
>>> qsdf = quantSparkDFList[i]
>>>
>>> sampleSDF = qsdf\
>>>
>>> .select( ["Name", "NumReads", ] )\
>>>
>>> .withColumnRenamed( "NumReads", sampleName )
>>>
>>>
>>>
>>> sampleSDF.createOrReplaceTempView( "sample" )
>>>
>>>
>>>
>>> # the sample name must be quoted else column names with a '-'
>>>
>>> # like GTEX-1117F-0426-SM-5EGHI will generate an error
>>>
>>> # spark think the '-' is an expression. '_' is also
>>>
>>> # a special char for the sql like operator
>>>
>>> # https://stackoverflow.com/a/63899306/4586180
>>>
>>> sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>>>
>>> from \n\
>>>
>>>rawCounts as rc, \n\
>>>
>>>sample  \n\
>>>
>>> where \n\
>>>
>>> rc.Name == sample.Name \n'.format(
>>> sampleName )
>>>
>>>
>>>
>>> rawCountsSDF = self.spark.sql( sqlStmt )
>>>
>>> rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>>>
>>>
>>>
>>>
>>>
>>> The way I wrote my script, I do a lot of transformations, the first
>>> action is at the end of the script
>>>
>>> retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
>>> header=True)
>>>
>>>
>>>
>>> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “)
>>> before calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
>>> expected to manage spark to manage the cache automatically given that I do
>>> not explicitly call cache().
>>>
>>>
>>>
>>>
>>>
>>> How come I do not get a similar warning from?
>>>
>>> sampleSDF.createOrReplaceTempView( "sample" )
>>>
>>>
>>>
>>> Will this reduce my memory requirements?
>>>
>>>
>>>
>>> Kind regards
>>>
>>>
>>>
>>> Andy
>>>
>>
>>
>> --
>> [image: vshapesaqua11553186012.gif]    *Jun Zhu*
>> Sr. Engineer I, Data
>> +86 18565739171
>>
>> [image: in1552694272.png] 
>> [image:
>> fb1552694203.png]   [image:
>> tw1552694330.png]   [image:
>> ig1552694392.png] 
>> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>>
>>


Re: ivy unit test case filing for Spark

2021-12-21 Thread Sean Owen
You would have to make it available? This doesn't seem like a spark issue.

On Tue, Dec 21, 2021, 10:48 AM Pralabh Kumar  wrote:

> Hi Spark Team
>
> I am building a spark in VPN . But the unit test case below is failing.
> This is pointing to ivy location which  cannot be reached within VPN . Any
> help would be appreciated
>
> test("SPARK-33084: Add jar support Ivy URI -- default transitive = true")
> {
>   *sc *= new SparkContext(new 
> SparkConf().setAppName("test").setMaster("local-cluster[3,
> 1, 1024]"))
>   *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
>   assert(*sc*.listJars().exists(_.contains(
> "org.apache.hive_hive-storage-api-2.7.0.jar")))
>   assert(*sc*.listJars().exists(_.contains(
> "commons-lang_commons-lang-2.6.jar")))
> }
>
> Error
>
> - SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
> FAILED ***
> java.lang.RuntimeException: [unresolved dependency:
> org.apache.hive#hive-storage-api;2.7.0: not found]
> at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
> SparkSubmit.scala:1447)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:185)
> at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
> DependencyUtils.scala:159)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
> at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
> at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
> scala:1041)
> at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> at org.scalatest.Transformer.apply(Transformer.scala:22)
>
> Regards
> Pralabh Kumar
>
>
>


ivy unit test case filing for Spark

2021-12-21 Thread Pralabh Kumar
Hi Spark Team

I am building a spark in VPN . But the unit test case below is failing.
This is pointing to ivy location which  cannot be reached within VPN . Any
help would be appreciated

test("SPARK-33084: Add jar support Ivy URI -- default transitive = true") {
  *sc *= new SparkContext(new
SparkConf().setAppName("test").setMaster("local-cluster[3,
1, 1024]"))
  *sc*.addJar("*ivy://org.apache.hive:hive-storage-api:2.7.0*")
  assert(*sc*.listJars().exists(_.contains(
"org.apache.hive_hive-storage-api-2.7.0.jar")))
  assert(*sc*.listJars().exists(_.contains(
"commons-lang_commons-lang-2.6.jar")))
}

Error

- SPARK-33084: Add jar support Ivy URI -- default transitive = true ***
FAILED ***
java.lang.RuntimeException: [unresolved dependency:
org.apache.hive#hive-storage-api;2.7.0: not found]
at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(
SparkSubmit.scala:1447)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:185)
at org.apache.spark.util.DependencyUtils$.resolveMavenDependencies(
DependencyUtils.scala:159)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1996)
at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
at org.apache.spark.SparkContextSuite.$anonfun$new$115(SparkContextSuite.
scala:1041)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)

Regards
Pralabh Kumar


Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-21 Thread Andrew Davidson
Hi Jun

Thank you for your reply. My question is what is best practices? My for
loop run over 16000 joins. I get an out of memory exception.

What is the indented use of createOrReplaceTempView if I need to manage the
cache or create a uniq name each time



Kind regards

Andy

On Tue, Dec 21, 2021 at 6:12 AM Jun Zhu  wrote:

> Hi
>
> As far as I know. The warning should be caused by create same temp view
> names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
> You create a view "rawCounts", then in for loop, second round, you create
> a new view with name "rawCounts", spark3 would uncache the
> previous "rawCounts".
>
> Correct me if I'm wrong.
>
> Regards
>
>
> On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson 
> wrote:
>
>> Happy Holidays
>>
>>
>>
>> I am a newbie
>>
>>
>>
>> I have 16,000 data files, all files have the same number of rows and
>> columns. The row ids are identical and are in the same order. I want to
>> create a new data frame that contains the 3rd column from each data file.
>> My pyspark script runs correctly when I test on small number of files how
>> ever I get an OOM when I run on all 16000.
>>
>>
>>
>> To try and debug I ran a small test and set warning level to INFO. I
>> found the following
>>
>>
>>
>> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
>> `rawCounts` before replacing.
>>
>>
>>
>> for i in range( 1, len( self.sampleNamesList ) ):
>>
>> sampleName = self.sampleNamesList[i]
>>
>>
>>
>> # select the key and counts from the sample.
>>
>> qsdf = quantSparkDFList[i]
>>
>> sampleSDF = qsdf\
>>
>> .select( ["Name", "NumReads", ] )\
>>
>> .withColumnRenamed( "NumReads", sampleName )
>>
>>
>>
>> sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>> # the sample name must be quoted else column names with a '-'
>>
>> # like GTEX-1117F-0426-SM-5EGHI will generate an error
>>
>> # spark think the '-' is an expression. '_' is also
>>
>> # a special char for the sql like operator
>>
>> # https://stackoverflow.com/a/63899306/4586180
>>
>> sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>>
>> from \n\
>>
>>rawCounts as rc, \n\
>>
>>sample  \n\
>>
>> where \n\
>>
>> rc.Name == sample.Name \n'.format(
>> sampleName )
>>
>>
>>
>> rawCountsSDF = self.spark.sql( sqlStmt )
>>
>> rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>>
>>
>>
>>
>>
>> The way I wrote my script, I do a lot of transformations, the first
>> action is at the end of the script
>>
>> retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
>> header=True)
>>
>>
>>
>> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
>> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
>> expected to manage spark to manage the cache automatically given that I do
>> not explicitly call cache().
>>
>>
>>
>>
>>
>> How come I do not get a similar warning from?
>>
>> sampleSDF.createOrReplaceTempView( "sample" )
>>
>>
>>
>> Will this reduce my memory requirements?
>>
>>
>>
>> Kind regards
>>
>>
>>
>> Andy
>>
>
>
> --
> [image: vshapesaqua11553186012.gif]    *Jun Zhu*
> Sr. Engineer I, Data
> +86 18565739171
>
> [image: in1552694272.png] [image:
> fb1552694203.png]   [image:
> tw1552694330.png]   [image:
> ig1552694392.png] 
> Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China
>
>


Re: ??? INFO CreateViewCommand:57 - Try to uncache `rawCounts` before replacing.

2021-12-21 Thread Jun Zhu
Hi

As far as I know. The warning should be caused by create same temp view
names.rawCountsSDF.createOrReplaceTempView( "rawCounts" )
You create a view "rawCounts", then in for loop, second round, you create a
new view with name "rawCounts", spark3 would uncache the
previous "rawCounts".

Correct me if I'm wrong.

Regards


On Tue, Dec 21, 2021 at 10:05 PM Andrew Davidson 
wrote:

> Happy Holidays
>
>
>
> I am a newbie
>
>
>
> I have 16,000 data files, all files have the same number of rows and
> columns. The row ids are identical and are in the same order. I want to
> create a new data frame that contains the 3rd column from each data file.
> My pyspark script runs correctly when I test on small number of files how
> ever I get an OOM when I run on all 16000.
>
>
>
> To try and debug I ran a small test and set warning level to INFO. I found
> the following
>
>
>
> 2021-12-21 00:47:04 INFO  CreateViewCommand:57 - Try to uncache
> `rawCounts` before replacing.
>
>
>
> for i in range( 1, len( self.sampleNamesList ) ):
>
> sampleName = self.sampleNamesList[i]
>
>
>
> # select the key and counts from the sample.
>
> qsdf = quantSparkDFList[i]
>
> sampleSDF = qsdf\
>
> .select( ["Name", "NumReads", ] )\
>
> .withColumnRenamed( "NumReads", sampleName )
>
>
>
> sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
> # the sample name must be quoted else column names with a '-'
>
> # like GTEX-1117F-0426-SM-5EGHI will generate an error
>
> # spark think the '-' is an expression. '_' is also
>
> # a special char for the sql like operator
>
> # https://stackoverflow.com/a/63899306/4586180
> 
>
> sqlStmt = '\t\t\t\t\t\tselect rc.*, `{}` \n\
>
> from \n\
>
>rawCounts as rc, \n\
>
>sample  \n\
>
> where \n\
>
> rc.Name == sample.Name \n'.format(
> sampleName )
>
>
>
> rawCountsSDF = self.spark.sql( sqlStmt )
>
> rawCountsSDF.createOrReplaceTempView( "rawCounts" )
>
>
>
>
>
> The way I wrote my script, I do a lot of transformations, the first action
> is at the end of the script
>
> retCountDF.coalesce(1).write.csv( outfileCount, mode='overwrite',
> header=True)
>
>
>
> Should I be calling sql.spark.sql( ‘uncache table rawCountsSDF “) before
> calling   rawCountsSDF.createOrReplaceTempView( "rawCounts" ) ? I
> expected to manage spark to manage the cache automatically given that I do
> not explicitly call cache().
>
>
>
>
>
> How come I do not get a similar warning from?
>
> sampleSDF.createOrReplaceTempView( "sample" )
>
>
>
> Will this reduce my memory requirements?
>
>
>
> Kind regards
>
>
>
> Andy
>


-- 
[image: vshapesaqua11553186012.gif]    *Jun Zhu*
Sr. Engineer I, Data
+86 18565739171

[image: in1552694272.png] [image:
fb1552694203.png]   [image:
tw1552694330.png]   [image:
ig1552694392.png] 
Units 3801, 3804, 38F, C Block, Beijing Yintai Center, Beijing, China