Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-28 Thread Gera Shegalov
I agree with the previous answers that (if requirements allow it) it is
much easier to just orchestrate a copy either in the same app or sync
externally.

A long time ago and not for a Spark app we were solving a similar usecase
via
https://hadoop.apache.org/docs/r3.2.3/hadoop-project-dist/hadoop-hdfs/ViewFs.html#Multi-Filesystem_I.2F0_with_Nfly_Mount_Points
. It may work with Spark because it is underneath the FileSystem API ...



On Tue, May 21, 2024 at 10:03 PM Prem Sahoo  wrote:

> I am looking for writer/comitter optimization which can make the spark
> write faster.
>
> On Tue, May 21, 2024 at 9:15 PM eab...@163.com  wrote:
>
>> Hi,
>> I think you should write to HDFS then copy file (parquet or orc)
>> from HDFS to MinIO.
>>
>> --
>> eabour
>>
>>
>> *From:* Prem Sahoo 
>> *Date:* 2024-05-22 00:38
>> *To:* Vibhor Gupta ; user
>> 
>> *Subject:* Re: EXT: Dual Write to HDFS and MinIO in faster way
>>
>>
>> On Tue, May 21, 2024 at 6:58 AM Prem Sahoo  wrote:
>>
>>> Hello Vibhor,
>>> Thanks for the suggestion .
>>> I am looking for some other alternatives where I can use the same
>>> dataframe can be written to two destinations without re execution and cache
>>> or persist .
>>>
>>> Can some one help me in scenario 2 ?
>>> How to make spark write to MinIO faster ?
>>> Sent from my iPhone
>>>
>>> On May 21, 2024, at 1:18 AM, Vibhor Gupta 
>>> wrote:
>>>
>>> 
>>>
>>> Hi Prem,
>>>
>>>
>>>
>>> You can try to write to HDFS then read from HDFS and write to MinIO.
>>>
>>>
>>>
>>> This will prevent duplicate transformation.
>>>
>>>
>>>
>>> You can also try persisting the dataframe using the DISK_ONLY level.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Vibhor
>>>
>>> *From: *Prem Sahoo 
>>> *Date: *Tuesday, 21 May 2024 at 8:16 AM
>>> *To: *Spark dev list 
>>> *Subject: *EXT: Dual Write to HDFS and MinIO in faster way
>>>
>>> *EXTERNAL: *Report suspicious emails to *Email Abuse.*
>>>
>>> Hello Team,
>>>
>>> I am planning to write to two datasource at the same time .
>>>
>>>
>>>
>>> Scenario:-
>>>
>>>
>>>
>>> Writing the same dataframe to HDFS and MinIO without re-executing the
>>> transformations and no cache(). Then how can we make it faster ?
>>>
>>>
>>>
>>> Read the parquet file and do a few transformations and write to HDFS and
>>> MinIO.
>>>
>>>
>>>
>>> here in both write spark needs execute the transformation again. Do we
>>> know how we can avoid re-execution of transformation  without
>>> cache()/persist ?
>>>
>>>
>>>
>>> Scenario2 :-
>>>
>>> I am writing 3.2G data to HDFS and MinIO which takes ~6mins.
>>>
>>> Do we have any way to make writing this faster ?
>>>
>>>
>>>
>>> I don't want to do repartition and write as repartition will have
>>> overhead of shuffling .
>>>
>>>
>>>
>>> Please provide some inputs.
>>>
>>>
>>>
>>>
>>>
>>>


Re: error trying to save to database (Phoenix)

2023-08-22 Thread Gera Shegalov
If you look at the dependencies of the 5.0.0-HBase-2.0 artifact
https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark/5.0.0-HBase-2.0
it was built against Spark 2.3.0, Scala 2.11.8

You may need to check with the Phoenix community if your setup with Spark
3.4.1 etc  is supported by something like
https://github.com/apache/phoenix-connectors/tree/master/phoenix5-spark3



On Mon, Aug 21, 2023 at 6:12 PM Kal Stevens  wrote:

> Sorry for being so Dense and thank you for your help.
>
> I was using this version
> phoenix-spark-5.0.0-HBase-2.0.jar
>
> Because it was the latest in this repo
> https://mvnrepository.com/artifact/org.apache.phoenix/phoenix-spark
>
>
> On Mon, Aug 21, 2023 at 5:07 PM Sean Owen  wrote:
>
>> It is. But you have a third party library in here which seems to require
>> a different version.
>>
>> On Mon, Aug 21, 2023, 7:04 PM Kal Stevens  wrote:
>>
>>> OK, it was my impression that scala was packaged with Spark to avoid a
>>> mismatch
>>> https://spark.apache.org/downloads.html
>>>
>>> It looks like spark 3.4.1 (my version) uses scala Scala 2.12
>>> How do I specify the scala version?
>>>
>>> On Mon, Aug 21, 2023 at 4:47 PM Sean Owen  wrote:
>>>
 That's a mismatch in the version of scala that your library uses vs
 spark uses.

 On Mon, Aug 21, 2023, 6:46 PM Kal Stevens 
 wrote:

> I am having a hard time figuring out what I am doing wrong here.
> I am not sure if I have an incompatible version of something installed
> or something else.
> I can not find anything relevant in google to figure out what I am
> doing wrong
> I am using *spark 3.4.1*, and *python3.10*
>
> This is my code to save my dataframe
> urls = []
> pull_sitemap_xml(robot, urls)
> df = spark.createDataFrame(data=urls, schema=schema)
> df.write.format("org.apache.phoenix.spark") \
> .mode("overwrite") \
> .option("table", "property") \
> .option("zkUrl", "192.168.1.162:2181") \
> .save()
>
> urls is an array of maps, containing a "url" and a "last_mod" field.
>
> Here is the error that I am getting
>
> Traceback (most recent call last):
>
>   File "/home/kal/real-estate/pullhttp/pull_properties.py", line 65,
> in main
>
> .save()
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py",
> line 1396, in save
>
> self._jwrite.save()
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py",
> line 1322, in __call__
>
> return_value = get_return_value(
>
>   File
> "/hadoop/spark/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py",
> line 169, in deco
>
> return f(*a, **kw)
>
>   File
> "/hadoop/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
> line 326, in get_return_value
>
> raise Py4JJavaError(
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o636.save.
>
> : java.lang.NoSuchMethodError: 'scala.collection.mutable.ArrayOps
> scala.Predef$.refArrayOps(java.lang.Object[])'
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.getFieldArray(DataFrameFunctions.scala:76)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:35)
>
> at
> org.apache.phoenix.spark.DataFrameFunctions.saveToPhoenix(DataFrameFunctions.scala:28)
>
> at
> org.apache.phoenix.spark.DefaultSource.createRelation(DefaultSource.scala:47)
>
> at
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:47)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73)
>



Re: Looping through a series of telephone numbers

2023-04-03 Thread Gera Shegalov
+1 to using a UDF.  E.g., TransmogrifAI uses

 libphonenumber https://github.com/google/libphonenumber that normalizes
phone numbers to a tuple (country code, national number), so you can find
more sophisticated matches for phones written in different notations.

If you simplify it for DataFrame/SQL-only use, you can create a Scala UDF:

$SPARK_HOME/bin/spark-shell --packages
com.googlecode.libphonenumber:libphonenumber:8.13.9
scala> :paste
// Entering paste mode (ctrl-D to finish)

import com.google.i18n.phonenumbers._
import scala.collection.JavaConverters._
val extractPhonesUDF = udf((x: String) =>
  PhoneNumberUtil.getInstance()
.findNumbers(x, "US").asScala.toSeq
.map(x => (x.number.getCountryCode, x.number.getNationalNumber)))
spark.udf.register("EXTRACT_PHONES", extractPhonesUDF)
sql("""
SELECT
  EXTRACT_PHONES('+496811234567,+1(415)7654321') AS needles,
  EXTRACT_PHONES('Call our HQ in Germany at (+49) 0681/1234567, in Paris at
: +33 01 12 34 56 78, or the SF office at 415-765-4321') AS haystack,
  ARRAY_INTERSECT(needles, haystack) AS needles_in_haystack
""").show(truncate=false)

// Exiting paste mode, now interpreting.

+---++---+
|needles|haystack
 |needles_in_haystack|
+---++---+
|[{49, 6811234567}, {1, 4157654321}]|[{49, 6811234567}, {33, 112345678},
{1, 4157654321}]|[{49, 6811234567}, {1, 4157654321}]|
+---++---+

On Sun, Apr 2, 2023 at 7:18 AM Sean Owen  wrote:

> That won't work, you can't use Spark within Spark like that.
> If it were exact matches, the best solution would be to load both datasets
> and join on telephone number.
> For this case, I think your best bet is a UDF that contains the telephone
> numbers as a list and decides whether a given number matches something in
> the set. Then use that to filter, then work with the data set.
> There are probably clever fast ways of efficiently determining if a string
> is a prefix of a group of strings in Python you could use too.
>
> On Sun, Apr 2, 2023 at 3:17 AM Philippe de Rochambeau 
> wrote:
>
>> Many thanks, Mich.
>> Is « foreach »  the best construct to  lookup items is a dataset  such as
>> the below «  telephonedirectory » data set?
>>
>> val telrdd = spark.sparkContext.parallelize(Seq(«  tel1 » , «  tel2 » , «  
>> tel3 » …)) // the telephone sequence
>>
>> // was read for a CSV file
>>
>> val ds = spark.read.parquet(«  /path/to/telephonedirectory » )
>>
>>   rdd .foreach(tel => {
>> longAcc.select(«  * » ).rlike(«  + »  + tel)
>>   })
>>
>>
>>
>>
>> Le 1 avr. 2023 à 22:36, Mich Talebzadeh  a
>> écrit :
>>
>> This may help
>>
>> Spark rlike() Working with Regex Matching Example
>> s
>> Mich Talebzadeh,
>> Lead Solutions Architect/Engineering Lead
>> Palantir Technologies Limited
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 1 Apr 2023 at 19:32, Philippe de Rochambeau 
>> wrote:
>>
>>> Hello,
>>> I’m looking for an efficient way in Spark to search for a series of
>>> telephone numbers, contained in a CSV file, in a data set column.
>>>
>>> In pseudo code,
>>>
>>> for tel in [tel1, tel2, …. tel40,000]
>>> search for tel in dataset using .like(« %tel% »)
>>> end for
>>>
>>> I’m using the like function because the telephone numbers in the data
>>> set main contain prefixes, such as « + « ; e.g., « +331222 ».
>>>
>>> Any suggestions would be welcome.
>>>
>>> Many thanks.
>>>
>>> Philippe
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: [Building] Building with JDK11

2022-07-18 Thread Gera Shegalov
Bytecode version is controlled by javac "-target" option for Java, and by
scalac "-target:" for Scala
JDK can cross-compile between known versions.

Spark uses 1.8 as source and target by default controlled by the Maven
property java.version
.
But it's also hard-coded with -target:jvm-1.8

for Scala. Higher JDK versions can run lower version bytecode.

if you want to try 11, replace occurences -target:jvm-1.8 by
--target:jvm-${java.version} in pom.xml you should be able to produce 11
bytecode by adding -Djava.version=11 to your Maven build command.

./build/mvn -Djava.version=11  ...

However, I did not try beyond a quick compile on core and cannot say
anything about fallout implications at run time.

On Mon, Jul 18, 2022 at 3:49 PM Szymon Kuryło 
wrote:

> Ok, so I used a docker image `adoptopenjdk/openjdk11:latest` as a builder,
> and still got version 52 classes.
>
> pon., 18 lip 2022 o 09:51 Stephen Coy  napisał(a):
>
>> Hi Sergey,
>>
>> I’m willing to be corrected, but I think the use of a JAVA_HOME
>> environment variable was something that was started by and continues to be
>> perpetuated by Apache Tomcat.
>>
>> … or maybe Apache Ant, but modern versions of Ant do not need it either.
>>
>> It is not needed for modern releases of Apache Maven.
>>
>> Cheers,
>>
>> Steve C
>>
>> On 18 Jul 2022, at 4:12 pm, Sergey B.  wrote:
>>
>> Hi Steve,
>>
>> Can you shed some light why do they need $JAVA_HOME at all if
>> everything is already in place?
>>
>> Regards,
>> - Sergey
>>
>> On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy <
>> s...@infomedia.com.au.invalid> wrote:
>>
>>> Hi Szymon,
>>>
>>> There seems to be a common misconception that setting JAVA_HOME will set
>>> the version of Java that is used.
>>>
>>> This is not true, because in most environments you need to have a PATH
>>> environment variable set up that points at the version of Java that you
>>> want to use.
>>>
>>> You can set JAVA_HOME to anything at all and `java -version` will always
>>> return the same result.
>>>
>>> The way that you configure PATH varies from OS to OS:
>>>
>>>
>>>- MacOS use `/usr/libexec/java_home -v11`
>>>- On linux use `sudo alternatives --config java`
>>>- On Windows I have no idea
>>>
>>>
>>> When you do this the `mvn` command will compute the value of JAVA_HOME
>>> for its own use; there is no need to explicitly set it yourself (unless
>>> other Java applications that you use insist u[on it).
>>>
>>>
>>> Cheers,
>>>
>>> Steve C
>>>
>>> On 16 Jul 2022, at 7:24 am, Szymon Kuryło 
>>> wrote:
>>>
>>> Hello,
>>>
>>> I'm trying to build a Java 11 Spark distro using the
>>> dev/make-distribution.sh script.
>>> I have set JAVA_HOME to point to JDK11 location, I've also set the
>>> java.version property in pom.xml to 11, effectively also setting
>>> `maven.compile.source` and `maven.compile.target`.
>>> When inspecting classes from the `dist` directory with `javap -v`, I
>>> find that the class major version is 52, which is specific to JDK8. Am
>>> I missing something? Is there a reliable way to set the JDK used in the
>>> build process?
>>>
>>> Thanks,
>>> Szymon K.
>>>
>>>
>>> This email contains confidential information of and is the copyright of
>>> Infomedia. It must not be forwarded, amended or disclosed without consent
>>> of the sender. If you received this message by mistake, please advise the
>>> sender and delete all copies. Security of transmission on the internet
>>> cannot be guaranteed, could be infected, intercepted, or corrupted and you
>>> should ensure you have suitable antivirus protection in place. By sending
>>> us your or any third party personal details, you consent to (or confirm you
>>> have obtained consent from such third parties) to Infomedia’s privacy
>>> policy. http://www.infomedia.com.au/privacy-policy/
>>>
>>
>>