Unsubscribe

2021-08-19 Thread Atlas - Samir Souidi
Unsubscribe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Unsubscribe

2021-08-19 Thread Sandeep Patra
Unsubscribe


Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Pedro Tuero
Hi, I'm sorry , the problem was really silly: In the test the number of
partitions were zero  (it was a division of the original number of
partitions of the RDD source and in the test that number was just one) and
that's why the test was failing.
Anyway, maybe the behavior is weird, I could expect that repartition to
zero was not allowed or at least warned instead of just discarting all the
data .

Thanks for your time!
Regards,
Pedro

El jue, 19 de ago. de 2021 a la(s) 07:42, Jacek Laskowski (ja...@japila.pl)
escribió:

> Hi Pedro,
>
> No idea what might be causing it. Do you perhaps have some code to
> reproduce it locally?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> "The Internals Of" Online Books 
> Follow me on https://twitter.com/jaceklaskowski
>
> 
>
>
> On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero  wrote:
>
>>
>> Context: spark-core_2.12-3.1.1
>> Testing with maven and eclipse.
>>
>> I'm modifying a project and a test stops working as expected.
>> The difference is in the parameters passed to the function aggregateByKey
>> of JavaPairRDD.
>>
>> JavaSparkContext is created this way:
>> new JavaSparkContext(new SparkConf()
>> .setMaster("local[1]")
>> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"));
>> Then I construct a JavaPairRdd using sparkContext.paralellizePairs and
>> call a method which makes an aggregateByKey over the input JavaPairRDD  and
>> test that the result is the expected.
>>
>> When I use JavaPairRDD line 369 (doing .aggregateByKey(zeroValue,
>> combiner, merger);
>>  def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U],
>> combFunc: JFunction2[U, U, U]):
>>   JavaPairRDD[K, U] = {
>> implicit val ctag: ClassTag[U] = fakeClassTag
>> fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
>>   }
>> The test works as expected.
>> But when I use: JavaPairRDD line 355 (doing .aggregateByKey(zeroValue,
>> *partitions*,combiner, merger);)
>> def aggregateByKey[U](zeroValue: U, *numPartitions: Int,* seqFunc:
>> JFunction2[U, V, U],
>>   combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
>> implicit val ctag: ClassTag[U] = fakeClassTag
>> fromRDD(rdd.aggregateByKey(zeroValue, *numPartitions)*(seqFunc,
>> combFunc))
>>   }
>> The result is always empty. It looks like there is a problem with the
>> hashPartitioner created at PairRddFunctions :
>>  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions:
>> Int)(seqOp: (U, V) => U,
>>   combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
>> aggregateByKey(zeroValue, *new HashPartitioner(numPartitions)*)(seqOp,
>> combOp)
>>   }
>> vs:
>>  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
>>   combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
>> aggregateByKey(zeroValue, *defaultPartitioner*(self))(seqOp, combOp)
>>   }
>> I can't debug it properly with eclipse, and error occurs when threads are
>> in spark code (system editor can only open file base resources).
>>
>> Does anyone know how to resolve this issue?
>>
>> Thanks in advance,
>> Pedro.
>>
>>
>>
>>


Re: spark-submit not running on macbook pro

2021-08-19 Thread Artemis User
Looks like PySpark can't initiate a JVM in the backend.  How did you set 
up Java and Spark on your machine?  Some suggestions that may help solve 
your issue:


1. Use OpenJDK instead of Apple JDK since Spark was developed using
   OpenJDK, not Apple's.  You can use homebrew to install OpenJDK (I
   don't see any reasons why you need to use Apple's JDK unless you are
   using the latest Mac.  See question below)
2. Download and deploy the Spark tarball directly from Spark's web site
   and run Spark's examples to test your environment using command line
   before integrating with PyCharm

My question to the group:  Does anyone have any luck with Apple's JDK 
when running Spark or other applications (performance-wise)? Is this the 
one with native libs for the M1 chipset?


-- ND


On 8/17/21 1:56 AM, karan alang wrote:


Hello Experts,

i'm trying to run spark-submit on my macbook pro(commandline or using 
PyCharm), and it seems to be giving error ->


Exception: Java gateway process exited before sending its port number

i've tried setting values to variable in the program (based on the 
recommendations by people on the internet), but the problem still remains.


Any pointers on how to resolve this issue?

# explicitly setting environment variables
os.environ["JAVA_HOME"] = 
"/Library/Java/JavaVirtualMachines/applejdk-11.0.7.10.1.jdk/Contents/Home"
os.environ["PYTHONPATH"] = 
"/usr/local/Cellar/apache-spark/3.1.2/libexec//python/lib/py4j-0.10.4-src.zip:/usr/local/Cellar/apache-spark/3.1.2/libexec//python/:"

os.environ["PYSPARK_SUBMIT_ARGS"]="--master local[2] pyspark-shell"

Traceback (most recent call last):
  File "", line 1, in 
  File "/Applications/PyCharm 
CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_bundle/pydev_umd.py", 
line 198, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # 
execute the script
  File "/Applications/PyCharm 
CE.app/Contents/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", 
line 18, in execfile

    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File 
"/Users/karanalang/Documents/Technology/StructuredStreamin_Udemy/Spark-Streaming-In-Python-master/00-HelloSparkSQL/HelloSparkSQL.py", 
line 12, in 

    spark = SparkSession.builder.master("local[*]").getOrCreate()
  File 
"/Users/karanalang/.conda/envs/PythonLeetcode/lib/python3.9/site-packages/pyspark/sql/session.py", 
line 228, in getOrCreate

    sc = SparkContext.getOrCreate(sparkConf)
  File 
"/Users/karanalang/.conda/envs/PythonLeetcode/lib/python3.9/site-packages/pyspark/context.py", 
line 384, in getOrCreate

    SparkContext(conf=conf or SparkConf())
  File 
"/Users/karanalang/.conda/envs/PythonLeetcode/lib/python3.9/site-packages/pyspark/context.py", 
line 144, in __init__

    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File 
"/Users/karanalang/.conda/envs/PythonLeetcode/lib/python3.9/site-packages/pyspark/context.py", 
line 331, in _ensure_initialized

    SparkContext._gateway = gateway or launch_gateway(conf)
  File 
"/Users/karanalang/.conda/envs/PythonLeetcode/lib/python3.9/site-packages/pyspark/java_gateway.py", 
line 108, in launch_gateway
    raise Exception("Java gateway process exited before sending its 
port number")

Exception: Java gateway process exited before sending its port number




[Announce] Spark on MR3

2021-08-19 Thread Sungwoo Park
Hi Spark users,

We would like to announce the release of Spark on MR3, which is Apache
Spark using MR3 as the execution backend. MR3 is a general purpose
execution engine for Hadoop and Kubernetes, and Hive on MR3 has been its
main application. Spark on MR3 is a new application of MR3.

The main motivation for developing Spark on MR3 is to allow multiple Spark
applications to share compute resources such as Yarn containers or
Kubernetes Pods. It can be particularly useful in cloud environments where
Spark applications are created and destroyed frequently. We wrote a blog
article introducing Spark on MR3:

https://www.datamonad.com/post/2021-08-18-spark-mr3/

Currently we have released Spark 3.0.3 on MR3 1.3. For more details on
Spark on MR3, you can check out the user guide:

https://mr3docs.datamonad.com/docs/spark/

Thanks,

--- Sungwoo


Re: Java : Testing RDD aggregateByKey

2021-08-19 Thread Jacek Laskowski
Hi Pedro,

No idea what might be causing it. Do you perhaps have some code to
reproduce it locally?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
"The Internals Of" Online Books 
Follow me on https://twitter.com/jaceklaskowski




On Tue, Aug 17, 2021 at 4:14 PM Pedro Tuero  wrote:

>
> Context: spark-core_2.12-3.1.1
> Testing with maven and eclipse.
>
> I'm modifying a project and a test stops working as expected.
> The difference is in the parameters passed to the function aggregateByKey
> of JavaPairRDD.
>
> JavaSparkContext is created this way:
> new JavaSparkContext(new SparkConf()
> .setMaster("local[1]")
> .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"));
> Then I construct a JavaPairRdd using sparkContext.paralellizePairs and
> call a method which makes an aggregateByKey over the input JavaPairRDD  and
> test that the result is the expected.
>
> When I use JavaPairRDD line 369 (doing .aggregateByKey(zeroValue,
> combiner, merger);
>  def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U],
> combFunc: JFunction2[U, U, U]):
>   JavaPairRDD[K, U] = {
> implicit val ctag: ClassTag[U] = fakeClassTag
> fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
>   }
> The test works as expected.
> But when I use: JavaPairRDD line 355 (doing .aggregateByKey(zeroValue,
> *partitions*,combiner, merger);)
> def aggregateByKey[U](zeroValue: U, *numPartitions: Int,* seqFunc:
> JFunction2[U, V, U],
>   combFunc: JFunction2[U, U, U]): JavaPairRDD[K, U] = {
> implicit val ctag: ClassTag[U] = fakeClassTag
> fromRDD(rdd.aggregateByKey(zeroValue, *numPartitions)*(seqFunc,
> combFunc))
>   }
> The result is always empty. It looks like there is a problem with the
> hashPartitioner created at PairRddFunctions :
>  def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)(seqOp:
> (U, V) => U,
>   combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
> aggregateByKey(zeroValue, *new HashPartitioner(numPartitions)*)(seqOp,
> combOp)
>   }
> vs:
>  def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
>   combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
> aggregateByKey(zeroValue, *defaultPartitioner*(self))(seqOp, combOp)
>   }
> I can't debug it properly with eclipse, and error occurs when threads are
> in spark code (system editor can only open file base resources).
>
> Does anyone know how to resolve this issue?
>
> Thanks in advance,
> Pedro.
>
>
>
>


How can I LoadIncrementalHFiles

2021-08-19 Thread igyu
 DF.toJavaRDD.rdd.hbaseBulkLoadThinRows(hbaseContext, 
TableName.valueOf(config.getString("table")), R => {
  val rowKey = Bytes.toBytes(R.getAs[String](name))
  val family = Bytes.toBytes(_family)
  val qualifier = Bytes.toBytes(name)
  var value: Array[Byte] = value = Bytes.toBytes(R.getAs[String](name))
  familyQualifiersValues += (family, qualifier, value)
}
  }

  (new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, config.getString("tmp"))

val table = 
connection.getTable(TableName.valueOf(config.getString("table")))
val load = new LoadIncrementalHFiles(conf)
load.doBulkLoad(new Path(config.getString("tmp")),
  connection.getAdmin, table, 
connection.getRegionLocator(TableName.valueOf(config.getString("table"
  }

I get a error

21/08/19 15:12:22 INFO LoadIncrementalHFiles: Split occurred while grouping 
HFiles, retry attempt 9 with 1 files remaining to group or split
21/08/19 15:12:22 INFO LoadIncrementalHFiles: Trying to load 
hfile=file:/d:/tmp/f/bb4706276d5d40c5b3014cc74dc39ddd first=Optional[0001] 
last=Optional[0003]
21/08/19 15:12:22 WARN LoadIncrementalHFiles: Attempt to bulk load region 
containing  into table sparktest1 with files [family:f 
path:file:/d:/tmp/f/bb4706276d5d40c5b3014cc74dc39ddd] failed.  This is 
recoverable and they will be retried.
21/08/19 15:12:22 INFO LoadIncrementalHFiles: Split occurred while grouping 
HFiles, retry attempt 10 with 1 files remaining to group or split
21/08/19 15:12:22 ERROR LoadIncrementalHFiles: 
-
Bulk load aborted with some files not yet loaded:
-
  file:/d:/tmp/f/bb4706276d5d40c5b3014cc74dc39ddd

Exception in thread "main" java.io.IOException: Retry attempted 10 times 
without completing, bailing out
at 
org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.performBulkLoad(LoadIncrementalHFiles.java:419)
at 
org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:342)
at 
org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:256)
at com.join.hbase.writer.HbaseWriter.saveTo(HbaseWriter.scala:167)
at com.join.Synctool$.main(Synctool.scala:587)
at com.join.Synctool.main(Synctool.scala)



file:/d:/tmp/f/bb4706276d5d40c5b3014cc74dc39ddd is existent

os hbaseBulkLoadThinRows function is OK

in official web I find

rdd.hbaseBulkLoad(TableName.valueOf(tableName),
  t => {
   val rowKey = t._1
   val family:Array[Byte] = t._2(0)._1
   val qualifier = t._2(0)._2
   val value = t._2(0)._3
   val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, family, qualifier)
   Seq((keyFamilyQualifier, value)).iterator
  },
  stagingFolder.getPath)
val load = new LoadIncrementalHFiles(config)
load.doBulkLoad(new Path(stagingFolder.getPath),
  conn.getAdmin, table, conn.getRegionLocator(TableName.valueOf(tableName)))

  path of hbaseBulkLoad and LoadIncrementalHFiles  is the same

stagingFolder.getPath

and I hbaseBulkLoad  expected local file


igyu